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ABSTRACT 


SPUR  is  a  RISC-based  multiprocessor  workstation  being  designed  to  facilitate 
parallel-processing  research.  TypicaUy.  RISC  architectures  achieve 
levels  for  floating-point  intensive  applicauons.  as  the  muluple-cyck 
instructions  are  not  implemented  in  the  hardware.  In  an  attempt  to  raise  these  perfor¬ 
mance  levels,  the  SPUR  system  provides  floating-point  support  through  an  extended 
instruction  set  and  a  tighUy-coupled  floating-point  coprocessor. 
the  implementation  of  the  control  unit  for  this  floaung-pomt  coproces^r, 
coprocessor  interface,  control  PLA  definitions,  the  finite 

cycle  counter,  the  4-stage  load-store  pipeline,  and  the  random  logic  generated  to  ^ve  ^e 
datapath  modules.  Implementation  techniques  and  trade-offs  arc  ^scussed;  “cluitog 
design  strategy,  area  and  speed  optimization,  noise  margin  considerations  and  delay 
balancing  of  the  datapath  control  signals  for  clock  skew  mmimizauon.  Simula 

Sn  results  obtained  using  SPICE.  CRYSTAL,  and  MOSSIM  are  presented.  Tbc  chip  is 
implemented  in  2-layer-metal  2pm  CMOS  technology 
overlapping  clock  with  a  target  cycle  time  of  approximately  100ns  -  140ns. 
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1.  INTRODUCTION 


1.1.  SPUR  System  Overview 

SPUR  (Symbolic  Processing  Using  RISCs)  is  a  shared-memory  multiprocessor  being  designed  here 
at  Berkeley  [7]  to  apply  the  RISC  (Reduced  Instruction  Set  Computer)  concept  to  a  parallel-processing 
workstation.  ITie  basic  SPUR  system  consists  of  6  to  12  identical  processors;  each  with  a  custom  32-bit 
central-processing  unit  (CPU),  a  128K-byte  instruction  and  data  cache  and  controller,  and  a  floating-point 
coprocessor  (see  Figure  1-1).  Each  of  the  processors  communicates  through  a  global  shared  memory 
which,  along  with  a  single  shared  bus.  simpUfies  parallel  programming  by  eliminating  the  problem  of 
specifying  complex  processor  interconnections.  Like  ifs  predecessors,  the  SPUR  CPU  follows  the  typical 
RISC  philosophy  of  (approximately)  one<ycle  pipelined  execution,  register-register  operations  with  load- 
store  accesses  to  memory,  hard-wired  control,  and  a  large  register  file  with  overlapping  windows. 


Figure  1-1  SPUR  System  With  Floating-Point  Coprocessor 

RISC,  evolved  as  t  way  10  ciremnvenl  the  problems  mherent  in  microcoded  control  without 
sacrificing  speed,  efficiency,  and  simplicity  of  design.  In  orfer  to  develop  an  efficient  processor  with  one- 
cycle  execution,  instructions  are  limited  to  register-register  operations  with  a  few  simple  addressing  modes. 
The  most  commonly  executed  operauons  are  optimixed  and  placed  in  the  hardware  while  less  Irequently 
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cxecuted  operations,  or  multiple-cycle  instructions,  arc  implemented  using  software  routines.  Executing 
multiple-cycle  floating-point  instrucUons  in  software  results  in  low  performance  for  floating-pomt  mtensive 
applications  [15].  One  goal  of  the  SPUR  project  was  to  increase  this  performance  level  by  providing  a 

floating-point  coprocessor. 

12.  Floating-Point  Coprocessor 

The  SPUR  floating-point  coprocessor  implements  the  IEEE  754  standard  without  microcode  by  exe¬ 
cuting  the  most  common  functions  in  hardware  and  trapping  to  software  to  handle  less  frequent  operations 
(such  as  transcendentals)  and  exceptions.  The  possible  exceptions  which  can  be  detected  [1]  include: 
invalid  operation,  overflow  and  underflow,  divide  by  zero,  and  inexact  result  due  to  rounding.  The  IEEE 
standard  defines  six  different  data  types  [8]:  normalized,  denomalized,  zero,  affine  infimty.  quiet  Not  a 
Number,  and  signalling  Not  a  Number.  The  normalized  and  zero  types  can  be  implemented  entirely  m 
hardware,  while  the  other  four  types  are  handled  at  least  partially  in  software. 

The  floating-point  coprocessor  is  tightly  coupled  to  the  SPUR  CPU  [6].  which  means  that  the  pres¬ 
ence  of  the  coprocessor  is  transparent  to  the  programmer,  who  secs  only  the  extended  instruction  set  That 
is.  the  coprocessor  is  a  feature  of  the  implementation  of  the  SPUR  system,  not  the  architecture.  If  a 
floating-point  instruction  is  encountered  and  a  coprocessor  is  present  in  the  system,  the  instruction  will  be 
executed  by  the  coprocessor,  otherwise  it  will  be  implemented  in  software.  The  CPU  handles  all  of  the 
communication  details  at  the  hardware  level  through  the  coprocessor  interface. 

1.2.1.  Coprocessor  Interface 

In  order  to  improve  floating-point  performance  through  the  use  of  an  external  coprocessor,  it  is 
essential  that  the  overhead  of  the  dau  transfer  between  the  central  processor  and  the  coprocessor  be 
insignificant  in  relation  to  the  speed-up  provided  by  the  coprocessor.  Several  features  of  the  SPUR  copro¬ 
cessor  interface  were  designed  to  minimize  this  overhead  [1].  In  particular,  the  floating-point  unit  (FPU) 
can  operate  concurrenUy  with  the  CPU.  FPU  loads  and  stores  can  be  executed  concuirenUy  with  arithmetic 
FPU  operations,  and  a  64-bit  data  bus  is  provided  to  transfer  data  direcUy  between  the  cache  and  the  FPU 
(under  control  of  the  CPU.  which  calculates  the  effective  memory  address). 
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CACHE  CONTROLLER  CACHE 


Figure  1-2  Osprocessor  Interface  Communication 


Figure  1-2  shows  the  interface  signals  used  for  communication  between  the  CPU  and  the  FPU. 
including  the  three  cache  signals  and  data  bus  used  in  data  transfer.  Assuming  no  misses  occur,  the  CPU 
receives  and  decodes  the  next  instruction  from  its  on-chip  instruction  buffer.  The  CPU  asserts  the 
fpuNewInstr  signal  if  the  decoded  instruction  is  a  floating-point  load  or  store  instruction,  or  if  it  is  an  arith¬ 
metic  floating-point  instruction  and  the  FPU  is  not  currenUy  busy  executing  an  arithmetic  operation.  The 
CPU  stalls  if  an  arithmetic  instruction  is  decoded  while  the  FPU  is  busy.  The  assertion  of  tilt  fpuNewInstr 
signal  causes  the  FPU  to  latch  the  insmiction  opcode  and  register  specifiers  into  the  instruction  register, 
where  they  are  again  decoded  and  the  instruction  is  executed.  The  addition  of  a  4-stage  load-store  pipeline 
(see  Section  2.4)  and  15  dual-ported  floating-point  registers  allows  floating-point  loads  and  stores  to  be 
executed  concurrently  with  arithmetic  floating-point  operations.  Ttit  fpuSuspend  signal  allows  the  CPU  to 
suspend  all  FPU  memory  operations,  and  is  an  input  to  the  FPU  internal  finite  state  machine.  Floating- 
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point  exceptions  are  detected  and  handled  by  the  CPU  when  \hc  fpu-EXCEPpin  line  is  asserted.  Also,  as 
the  CPU  fetches  the  instrucUons  and  maintains  the  program  counter.  thtfpu-BR-TlFpin  is  needed  to  indi¬ 
cate  the  result  (and  branch  direction)  of  a  floating-point  compare  operation. 


1.2.2.  Elxtended  Instruction  Set  and  Data  Formats 

The  extended  floating-point  instruction  set  [2,6]  is  shown  in  Table  1-1.  Maintaining  the  RISC  philo¬ 
sophy.  the  insmicUon  set  contains  only  register-register  arithmetic  instructions  and  load-store  memory 
operations.  All  of  the  arithmetic  operations  except  multiply  and  divide  require  four  cycles.  The  multiply 
and  divide  instructions  are  implemented  using  iterative  algorithms  (See  Section  3.1.3)  and  thus  take  9  and 
22  cycles,  respecUvely.  Loading  data  fiom  the  cache  into  the  FPU  requires  only  one  cycle,  while  storing 
data  into  the  cache  from  the  FPU  takes  two  cycles. 


Table  1-1  SPUR  Extended  Floating-Point  Instruction  Set 


Instruction  Type 

Instruction  Syntax 

Instruction  Semantics 

Cycles 

ARITHMETIC 

FADD 

RdRslRs2 

Rd  <-  Rsl  +  Rs2 

4 

ARITHMETIC 

FSUB 

RdRsl4ls2 

RdoRsl  -Rs2 

4 

ARITHMETIC 

FMUL 

Rd4lsljls2 

Rd  <-  Rsl  •  Rs2 

9 

arithmetic 

FDIV 

RdRslRs2 

Rd  <-Rsl/Rs2 

22 

ARITHMETIC 

FABS 

RdRsl.0 

Rd  <-  Rsl;  sign  <-  0 

4 

ARITHMETIC 

FNEG 

RdRsl.0 

Rd  <- Rsl;  sign  <--sign 

taMAiiUiiAlB 

FCMP 

cond.RslJls2 

FPSW(cond)  <-  result 

■1 

FMOV 

RdRsLO 

Rd  <-  Rsl 

eVTS 

RdRsl.0 

Rd(sgl)  <-  Rsl  (ext) 

eVTD 

RdRsl.O 

Rd(dbl)  <-  Rsl  (ext) 

LD  SGL 

RdRslRC 

Rd  <-  M(Rsl+RC) 

1 

LOAD 

LD  DBL 

RdRslRC 

Rd  <-  M(Rsl+RQ 

1 

LOAD 

LD  EXTl 

RdRslRC 

Rd  <-  M(Rsl-hRC) 

1 

LOAD 

LD  EXT2 

RdRslRC 

Rd  <-  M(Rsl-t-RC) 

1 

STORE 

ST  SGL 

Rs2Rsl,SC 

Rs2->  M(Rsl-t-SC) 

2 

STORE 

ST  DBL 

Rs2Rsl.se 

Rs2  ->  M(Rsl-t-SC) 

2 

STORE 

ST  EXTl 

Rs2Rsl.se 

Rs2  ->  M(Rsl+SC) 

2 

STORE 

ST  EXT2 

Rs2Rsl,SC 

1  Rs2 ->  M(Rsl+SC) 

2 

As  shown  in  the  table,  there  are  four  separate  types  of  load  and  store  instructions.  This  is  because 
single,  double,  and  extended  data  formats  may  be  specified  (see  Figure  1-3).  and  in  the  case  of  the  extended 
format  a  separate  load/store  instruction  is  required  for  each  64-bil  word.  Although  data  may  exist  in 
memory  in  any  of  the  above  formats,  only  the  extended  formal  is  actuaUy  implemented  in  the  hardware. 
Therefore,  when  data  in  the  single  or  double  format  is  loaded  into  the  FPU.  it  is  automaucally  converted  to 
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the  extended  format  by  unpacking  the  exponent  and  fraction  and  assigning  them  to  the  extended  fields,  and 
setting  tag  bits  to  specify  one  of  the  six  IEEE  data  types.  To  store  data  from  the  FPU  into  memory  in  the 
single  or  double  format  requires  that  the  proper  convert  instruction  be  used  in  order  to  convert  the  data 
from  the  extended  format  into  the  desired  format  before  the  store  is  implemented. 


SINGLE 

DOUBLE 

EXTENDED 


« 

0 

ss 

EXIV7«> 

B 

FRACnON<2iO> 

////////// 

• 

B  . 

0 

EXI><10A» 

FRACnON<51K)> 

41  H  M  B _ 1 

0 

EXP<16K)>  y/  y  RT  DT 

////////// 

a 

e 

FRACnON<63<» 

Figure  1-3  FPU  Data  Fewmats 


UJ.  FPUFloorplan 

The  basic  structure  of  the  floating-point  chip  is  depicted  in  Figure  1-4.  The  FPU  consists  primarily 
of  four  main  modules:  the  exponent  (EXP),  fraction,  and  multiply-divide  (MULDIV)  blocks,  which  consQ- 
tute  the  datapath;  and  the  control  module,  which  is  the  focus  of  this  report.  The  control  unit  latches  and 
decodes  incoming  floating-point  instructions,  using  a  combination  of  PLAs  and  random  logic  to  generate 
the  necessary  signals  to  control  the  datapath. 

The  main  section  of  the  control  unit  consists  of  the  control  programmable  logic  arrays  (PLAs),  a 
cycle  counter,  and  interface  logic,  whereas  the  logic  which  generates  the  individual  datapath  control  signals 
are  located  in  the  random  logic  strips  in  the  proximity  of  the  datapath  block  which  they  are  controlUng. 
The  inputs  to  this  logic  are  generally  routed  direcUy  from  the  main  control  block,  though  in  some  instances 
these  inputs  come  from  other  portions  of  the  datapath.  TTie  outputs  of  this  logic  are  then  individuaUy  buf¬ 
fered  to  drive  the  datapath. 

The  floating-point  chip  is  implemented  in  Turn  CMOS  technology  with  two  metal  layers,  where  the 
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Figure  1-4  FPU  Flcwrplan 


power  supply  and  data  Unes  run  horizontaUy  in  metal-2  and  the  control  lines  run  verticaUy  in  metal-1.  The 
entire  FPU  design  is  implemented  assuming  a  four-phase  non-overlapping  clock,  where  the  target  cycle 
lime  is  approximately  100ns  -  140ns  with  a  non-overlap  time  of  5ns  -  10ns  between  each  phase. 

13.  Report  Outline 

As  stated  above,  this  report  focuses  on  the  implementation  of  the  control  unit  for  the  SPUR  fioaung- 
point  coprocessor.  When  I  joined  the  project  this  year,  much  of  the  control  was  already  specified  at  the 
functional  level  and  had  been  simulated  using  SLANG  [18.19].  but  no  control  had  been  layed  out.  This 
report  documents  the  control  layout  completed  for  this  project,  beginning  with  a  description  at  the  func¬ 
tional  level  and  progressing  down  to  the  low-level  circuit  details,  including  alternative  implementations. 


functional  verification,  and  simulation  results. 
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Chapter  2  describes  the  main  control  block  and  the  interface  control,  starting  with  an  overview  of  the 
entire  control  uniL  TTie  second  section  describes  the  three  PLAs  central  to  the  control  unit  design.  This 
section  is  foUowed  by  a  discussion  of  the  implementation  of  the  cycle  counter.  The  last  section  of  this 
chapter  is  centered  on  the  design  of  the  load-store  pipeline. 

Ch^ter  3  revolves  around  the  datapath,  including  a  description  of  the  main  datapath  modules  and  the 
random  logic  used  to  control  the  modules.  The  first  sections  of  this  chapter  describe  the  operation  of  the 
three  datapath  modules;  the  exponent,  fraction,  and  multiply-divide  units.  The  next  sections  discuss  the 
layout  strategies  considered  for  the  random  control  logic,  including  layout  structure  and  regulanty.  speed 
optimization,  and  area  minimization.  This  discussion  is  concluded  with  a  comparison  of  the  two  alternative 
methods  actually  implemented.  The  foUowing  section  deals  with  clock  skew  considerations  for  the  control 
lines,  including  capacitance  extractions  of  the  datapath  and  individualized  buffer  sizing  techniques. 

The  first  section  of  Chapter  4  presents  an  overview  of  the  CAD  environment,  outlining  the  various 
simulation  tools  employed  and  the  interface  between  the  tools.  The  following  sections  present  the  overall 
simulation  results  obtained  using  these  tools  including,  when  applicable,  a  comparison  between  the  various 

tools. 

Chapter  5  summarizes  the  work  completed,  providing  conclusions  and  suggestions  for  future 
research.  This  includes  interesting  points  discovered,  lessons  learned,  and  suggested  paths  to  follow/avoid 
in  future  work.  The  following  two  chapters  contain  acknowledgements  and  references. 

A  lengthy  appendix  is  attached  which  is  broken  into  six  main  parts.  Appendix  A  defines  the  inputs 
and  outputs  of  the  three  control  PLAs.  Appendix  B  includes  cell  schematics  and  documentation  which  is 
referenced  in  this  report.  Appendix  C  includes  the  layout  plots  of  most  of  the  unique  ceUs  incorporated  m 
the  control  unit,  ranging  from  single  ceU  plots  up  to  a  plot  of  the  entire  floating-pomt  control  uniu  Appen¬ 
dix  D  documents  the  control  signal  definitions  and  their  implementation.  Tables  of  the  extracted  capaci¬ 
tances  for  each  of  the  datapath  modules,  including  buffer  sizes  and  total  delay  times,  are  given  in  Appendix 
E.  Finally,  Appendix  F  contains  source  files  for  each  of  the  simulators. 
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2.  INTERFACE  CONTROL  UNIT 
2.1.  Control  Unit  Overview 

A  simplified  block  diagram  of  the  control  unit  is  shown  in  Figure  2-1.  New  floating-pomt  mstruc- 
tions  are  latched  on  the  chip  in  clock  phase  3.  The  following  phase,  the  instruction  PLA  decodes  the 
opcode  from  the  instrucUon  latch,  and  the  register  destination  is  latched  into  the  first  stage  of  the  load-store 
pipeline  (see  Figure  2-5).  along  with  load-store  specifiers  in  the  case  of  a  load  or  store  mstrucuon.  The  4- 
stage  pipeline  aUows  the  FPU  to  execute  load-store  operations  concurrently  with  the  multi-cycle  arithmetic 
operations,  as  the  required  information  can  be  latched  into  the  pipeline  every  cycle.  If  an  arithmetic  opera¬ 
tion  is  received,  and  the  FPU  is  not  currently  busy  executing  a  previous  arithmetic  operation,  the  operation 
type  is  latched  into  the  arithmetic  ops  register,  where  it  serves  as  an  input  to  the  arithmenc  PLA.  Usmg  a 
3-bit  state  register  and  a  PLA.  the  internal  finite  state  machine  records  the  current  state  of  the  FPU  (see  Fig¬ 
ure  2-2)  and  maintains  the  fpuBusy  signal  along  with  various  other  signals.  The  cycle  counter  keeps  track 
of  the  current  instruction  cycle  dependent  on  the  state  of  the  FPU.  and  is  decoded  by  the  arithmetic  PLA. 
Outputs  of  the  arithmetic  PLA.  cycle  counter,  finite  state  mxhine.  and  load-store  pipelme  serve  to  generate 
the  random  control  signals  required  by  the  datapath  modules. 

2.2.  PLA  Partitioning 

As  shown  in  Figure  2-1.  three  PLAs  form  the  core  of  the  control  unit:  the  instruction  PLA.  the  arith¬ 
metic  PLA.  and  the  PLA  in  the  internal  finite  state  machine  (IFSM).  Each  of  the  PLAs  has  a  set  of  pass 
gates  associated  with  it  to  ensure  that  the  inputs  to  the  PLA  cannot  change  unless  the  pass  gate  control  is 
asserted.  That  is.  the  pass  gates  aUow  the  PLAs  to  operate  only  in  the  phase  associated  with  the  pass  gate 
control;  in  all  other  phases  the  outputs  of  the  PLAs  wiU  be  considered  stable  and  valid.  The  phase  of 
operation  for  each  of  the  control  PLAs  is  indicated  in  Table  2-1.  As  an  example,  since  the  arithmetic  PLA 
is  evaluated  in  phase  2.  its  outputs  can  be  used  only  in  phases  3  and  4  of  the  current  cycle,  and  phase  1  of 
the  next  cycle;  signals  needed  in  phase  2  must  by  created  independent  of  the  PLA.  Furthermore,  all  of  the 
outputs  of  the  PLAs  are  generated  independent  of  clock  phase.  That  is.  if  a  control  signal  is  defined  as  a 
function  of  various  inputs  and  a  clock  phase,  the  logical  and  of  the  PLA  output  and  the  clock  phase  must  be 
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Figure2-1  Control  Unit  Block  Diagram  (The  shaded  blocks  are  PLAs). 


implemented  outside  the  PLA. 

As  menuoned  earlier,  the  funcuon  of  the  instruction  PLA  is  to  decode  the  instiucuon  opcode,  which 
is  its  only  input  (see  Appendix  A  for  a  definiuon  of  all  PLA  inputs  and  outputs).  The  arithmetic  PLA  gen¬ 
erates  control  signals  for  arithmeUc  operations,  basically  dependent  on  the  type  of  instruction  being  exe¬ 
cuted  and  the  current  cycle  count  of  that  instruction.  The  IFSM  is  used  to  keep  track  of  the  current  state  of 
the  FPU.  The  state  transition  diagram  defining  the  IFSM  is  shown  in  Figure  2-2.  As  seen  in  the  diagram, 
the  state  of  the  FPU  is  dependent  on  its  current  state  and  the  conuol  signals  ctrl-TrapRecvd,  ctrl-start- 
arithop,  ctrl-fpuSuscond,  and  ctrl-STOP.  As  there  are  eight  unique  states  possible,  a  state  vector  of  three 
bits  is  maintained  external  to  the  IFSM  to  hold  the  current  FPU  state. 
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Figure  2-2  State  Transition  Diagram 


The  PLAs  were  each  simulated  using  CRYSTAL  [13,16]  to  determine  the  worst-case  delay  times. 
The  results  obtained  are  shown  in  Table  2-1,  along  with  the  corresponding  PLA  area.  The  last  two 
columns  list  the  worst-case  propagation  delays  for  the  three  PLAs.  where  the  first  column  of  delays 
corresponds  to  the  worst-case  times  for  the  PLA  outputs  to  fall  from  high  to  low  measured  relative  to  the 
change  in  the  input.  Likewise,  the  last  column  corresponds  to  the  worst-case  times  for  the  PLA  outputs  to 
rise  from  low  to  high.  The  high-to-low  delays  are  consistently  the  slowest  among  the  PLAs,  where  the 
worst-case  is  1 1.35ns.  This  is  suitable  for  use  in  a  20ns  (minimum)  phase  time  while  still  allowing  the  PLA 
outputs  to  be  latched  within  the  same  phase  of  evaluation  (see  Table  2-4)  with  an  appropriate  safety  mar¬ 
gin. 


Table  2-1  Comparison  of  Control  PLA  Delay  Versus  Size 


PLA 

Phase 

Inputs 

Outputs 

Area  (ym^) 

tp.  (ns) 

tp.  (ns) 

instruction 

PH14 

7 

25 

142746 

11.13 

9.93 

arithmetic 

PHI2 

18 

19 

180675 

10.34 

9.74 

finite  state  machine 

PHIl 

7 

7 

70282 

11.35 

9.43 

As  seen  here,  there  is  little  difference  (only  about  0.5ns)  between  the  delay  times  although,  for  exam- 


-11- 


ple.  the  instruction  PLA  is  about  twice  the  size  of  the  finite  state  machine.  However,  if  all  the  PLAs  had 
been  condensed  into  one  big  PLA  the  total  area  would  have  been  approximately  five  times  greater,  and  we 
can  extrapolate  that  the  extra  delay  incurred  would  be  at  least  2.5ns.  not  counting  the  delays  involved  in  the 
external  routing  required  to  and  from  the  PLA.  As  it  is.  the  partitioning  of  the  PLAs  effectively  minimizes 
the  average  delay  time  while  mainuining  a  logical  grouping  by  function. 


23.  Cycle  Counter 

In  order  to  keep  track  of  the  progression  of  the  cmrent  instruction,  a  cycle  counter  has  been  incor¬ 
porated  as  part  of  the  control  unit  For  this  purpose,  a  5-bit  counter  was  required,  since  the  maximum 
number  of  cycles  occurring  in  any  FPU  instruction  is  22  (sec  Table  1-1).  As  shown  in  Figure  2-3.  the  cycle 
value  is  clocked  into  the  master  latch  in  phase  1,  and  into  the  slave  latch  in  phase  2.  In  phase  3,  the  current 
cycle  value  is  passed  through  to  the  increment  logic,  thus  ensuring  that  the  counter  can  be  incremented  only 
once  per  cycle.  If  the  FPU  is  currently  busy  with  an  instruction  or  starting  a  new  instruction,  and  is  not 
suspended,  the  cycle  value  is  incremented  by  one.  Otherwise,  the  old  cycle  value  is  again  clocked  into  the 
master  latch.  The  counter  may  be  cleared  asynchronously  in  phase  3. 

Usually,  the  current  cycle  value  is  fed  into  the  arithmetic  PLA  and  is  decoded  by  the  end  of  phase  2. 
As  discussed  in  the  previous  section,  this  means  that  the  decoded  cycle  value  can  only  be  used  in  phases  3 
and  4.  or  phase  1  of  the  foUowing  cycle.  The  five  cycUclock-inii  lines  are  used  as  inputs  to  random  logic 
which  locally  decodes  the  cycle  value  for  control  signals  which  arc  needed  in  phase  2. 

The  control  signal  ctrl-fpuBusy  is  latched  and  stable  by  the  end  of  phase  2,  ctrl-start-arithop  by  the 
end  of  phase  4.  and  cirl-fpuSuscond  becomes  available  in  phase  1  (sec  Figure  A-Bl  in  Appendix  B).  Thus, 
the  critical  path  here  is  to  ensure  that  ctrl-fpuSwscond  stabilizes  early  enough  in  phase  1  to  allow  the  incre- 
mentor  to  work  and  the  count  value  to  be  latched  by  the  end  of  that  phase.  As  the  input  to  the  ctrl- 
fpuSuscond  latch  is  stable  by  the  end  of  phase  4  and  the  worst-case  delay  for  the  static  latch  is  3.0ns  (see 
Table  2-4),  the  worst-case  delay  for  the  increment  logic  (2nand2nand  driving  0.404pF)  is  2.0ns,  and  the 
set-up  time  for  the  master  latch  is  3.0ns.  we  stiU  have  about  12ns  left  to  actually  perform  the  5-bit  incre¬ 


ment.  (Note  that  this  assumes  a  minimum  specified  phase  time  of  20ns.) 
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Two  alternate  methods  of  perfuming  the  increment  were  studied:  the  dynamic  one  shown  below, 
which  I  used,  and  the  static  cany-look-ahead  type  used  in  the  fraction  datapath.  The  latter  method  was 
easiest  to  implement  since  the  desired  format  already  existed  in  the  datapath.  However,  this  method  was 
too  slow  to  meet  the  above  specifications,  due  to  the  overhead  involved  in  the  carry-look-ahead  precondi¬ 
tioning  ~  this  overhead  is  simply  too  expensive  when  only  a  5-bit  increment  is  mvolved.  The  cany-look¬ 
ahead  logic  also  inclined  a  large  area  overhead.  In  fact,  the  incrementor  alone  using  carry-look-ahead  took 
up  almost  as  much  space  as  the  whole  counter  when  implemented  using  the  dynamic  method! 

The  dynamic  5-bit  incrementor  shown  in  Figure  2-4  is  based  on  the  sum  and  carry  definitions  denved 
from  the  truth  table  in  Table  2-2.  Using  simple  Boolean  algebra,  we  see  that  the  sum  for  a  bit  is  the 
exclusive-or  of  the  current  input  and  the  carry  into  that  bit,  whereas  the  carry-out  of  that  bit  is  the  logical 
and  of  the  cunent  input  and  the  carry-in.  The  carry-in  to  the  first  cell  is  simply  the  output  of  the  increment 
logic.  That  is,  if  it  is  desired  to  increment  the  carry-in  is  high,  and  thus  one  is  added  to  the  current  value. 


Otherwise,  the  value  remains  unchanged. 
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Table  2-2  Sum  and  Carry  DefiniticKi 


Hn7_L  "«'Hr -L-  '"'HL-L-  ""’H 
2hrn  ..  Jbrn  ■ 


-1-3  t3  “p3  -jO 

Ex.  p  p  kx 

V  V  V  7 

s,  S.  S, 

Figure  2-4  5-Bil  Increment  Implementation 


A  prime  consideration  here  was  the  delay  incurred  in  the  dynamic  cany  propagation  from  the  least 
significant  bit  to  the  most  significant  bit  through  the  chain  of  pass  gates.  Various  p-channel  sizes  were 
simulated  using  CRYSTAL  to  find  an  appropriate  trade-off  between  layout  area  and  propagation  delay.  As 
seen  in  Table  2-3,  an  equivalent  pass  width  of  2k  (8/4)  results  in  a  total  delay  (including  the  exclusive-or 
delay)  of  75.74ns,  whereas  an  equivalent  pass  width  of  12X  (48/4)  brings  this  delay  below  5ns.  Further 
increases  in  p-channel  width  result  in  comparatively  small  gains  in  speed.  Each  of  the  p-channel  sizes  were 
thus  chosen  to  be  48X,  and  the  corresponding  n-channel  sizes  were  made  24X  with  no  additional  area 
requirements.  Since  the  mobility  of  p-type  carriers  is  about  50%  less  than  that  of  n-type  carriers,  and  the 
mobility  of  the  transistor  carrier  is  directly  proportional  to  the  current  in  the  transistor  channel,  keeping  the 
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2:1  ratio  in  width  between  the  p-channel  and  the  n-channel  serves  to  equalize  the  available  currents  (and 
thus  the  delays)  through  both  types  of  transistors.  This  ensure  that  in  the  worst  case,  where  the  inc  signal 
goes  low  late  in  phase  1,  any  carry  which  may  have  been  propagated  up  to  that  point  can  be  discharged 
through  the  n-channel  pass  gates  at  approximately  the  same  rate  as  the  p-channel  gates  propagated  the 
carry.  The  discharge  transistors  ensure  that  no  carry  is  propagated  except  during  phase  1,  when  the  actual 
iixirement  and  latching  occurs. 

Table  2-3  P-Channel  Width  Versus  Carry  Propagation  Delay 


Channel  (X) 

Cell  Size  (X) 

U  (ns) 

8 

54 

75.74 

48 

68 

4.99 

75 

75 

4.06 

The  implementation  of  the  exclusive-or  circuit,  derived  by  a  truth  table  similar  to  that  above,  is 
shown  in  Figure  A-B2  in  Appendix  B.  The  1.7ns  delay  associated  with  the  exclusive-or  function  could  pos¬ 
sibly  be  reduced,  if  necessary,  simply  by  increasing  the  transistor  widths,  which  are  currently  minimum 

size. 


2.4.  Load-Store  Pipeline 

The  loading  of  data  from  memory  into  the  on-chip  register  file  and  the  storing  of  data  into  memory 
from  the  register  file  is  handled  by  the  load-store  pipeline  and  its  associated  memory  control  logic.  As 
shown  in  Figure  2-5,  the  decode  stage  of  the  pipeline  receives  the  load-store  and  register  destination 
specifier  information  in  phase  4  if  a  new  FPU  instruction  is  signaUed.  The  load-store  information  is 
obtained  from  the  insffuction  PLA,  where  it  is  decoded  from  the  instruction  opcode,  while  the  register  des¬ 
tination  specifier  is  received  directly  from  the  instruction.  If  the  FPU  is  not  suspended  and  no  traps  have 
been  received,  this  information  is  passed  from  one  pipeline  stage  to  the  next  in  phase  1,  allowmg  the  pipe¬ 
line  10  accept  new  memory  information  every  cycle.  The  pipeline  is  suspended  along  with  the  FPU  by 
recirculating  the  current  contents  of  the  two  intermediate  stages  through  a  mux  rather  than  passing  along 
the  contents  of  the  previous  stage.  In  this  case,  the  contents  of  the  write  stage  are  cleared.  If  a  trap  occurs, 
the  two  intermediate  stages  are  cleared,  effectively  flushing  the  pipeline. 
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fram  Instructian  PLA  from  Instruction  Lstch 


Figure  2-5  Load-Store  Pipeline  Block  Diagram 


The  implementation  of  the  pipeline  was  very  systematic  as  it  contained  only  two  basic  cells:  a  2:1 
mux  and  a  clearable  latch.  Since  a  dynamic  clearable  latch  offered  little  time  or  area  savings  yet  had  ques- 
Uonable  noise  margins  (see  Table  2-4),  full  static  cells  were  used  (see  Figures  A-B3  and  A-B4  in  Appendix 
B).  For  both  the  load-store  pipeline  and  the  cycle  counter.  I  chose  to  maintain  the  llX  cell  pitch  for  con¬ 
sistency  with  the  datapath  modules,  thus  allowing  me  to  access  an  existing  library  for  many  of  the  basic 
cell  forms  needed.  These  cells  required  modification  basically  to  reduce  the  amount  of  routing  associated 
with  interconnecting  the  cells,  some  custom  sizing  of  transistors,  and  simulation  to  verify  functionality  and 
timing  constraints.  The  static  mux  was  simulated  using  SPICE  [20]  and  found  to  have  a  delay  time  of  0.5ns 
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(assuming  an  output  capacitance  of  0.4pF).  SPICE  parameters  obtained  for  both  the  static  and  dynamic 
latches  (assuming  an  output  capacitance  of  O.SpF)  are  listed  in  Table  2-4. 

TabIe2-4  Comparison  of  Latch  Circuit  Parameters 


Parameter 

Static 

Dynamic 

3.0  ns 

2.0  ns 

4.0  ns 

3.0  ns 

lEEl 

0.5  ns 

■SB 

The  memory  control  logic  associated  with  the  load-store  pipeline  uses  the  same  random  logic  design 
discussed  in  the  next  chapter.  However,  for  the  sake  of  modularity,  this  logic  is  included  as  part  of  the 
load-store  pipeUne.  as  it  is  used  by  both  the  exponent  and  fraction  front-ends.  Detailed  definiuons  of  the 
control  logic  and  implementation  can  be  found  in  Appendix  D. 


3.  DATAPATH  CONTROL  UNIT 


3.1.  Datapath  Modules 

The  following  three  sections  present  a  brief  overview  of  the  op^ation  of  the  main  dat^th  modules: 
the  exponent  unit,  the  fraction  unit,  and  the  multiply-divide  unit  As  seen  in  the  simple  block  diagrams 
given  with  each  section,  the  majority  of  the  dat^th  consists  of  only  a  few  basic  types  of  simple  cells;  mul¬ 
tiplexors.  latches,  shifters,  tri-state  bus  drivers,  and  adders.  A  few  miscellaneous  cells  are  also  needed, 
including  the  register  ccUs  and  the  front-end  unpacking  and  convert  logic.  For  the  most  part,  this  regularity 
greatly  simplifies  layout  generation  and  simulation. 

Control  logic  is  associated  with  most  of  the  basic  ceUs.  such  as  select  lines  for  the  multiplexors, 
clock  lines  for  the  latches,  enables  for  the  tri-state  buffers,  and  so  on.  The  second  part  of  this  chapter  is 
concerned  with  the  generation  of  this  control  logic,  including  implementation  methods  and  problems  such 
as  speed  optimizauon.  area  minimization,  and  clock  skew.  Appendix  E  contains  detaUed  diagrams  for  each 
of  the  datapath  modules,  indicating  the  relative  placement  of  the  control  signals  used  for  each  of  the  basic 
cells.  The  labels  used  in  the  diagrams  are  the  actual  names  given  the  layout  cells  and  their  associated  con¬ 
trol. 

3.1.1.  Exponent  Unit 

A  simplified  block  diagram  of  the  17-bit  exponent  datapath  [1]  is  shown  in  Figure  3-1.  The  opera¬ 
tion  of  the  datapath  is  very  straightforward.  Data  is  loaded  into  the  register  file  (from  the  bottom  of  the 
figure),  being  convened  to  the  extended  format  and  setting  data  type  tags  if  necessary.  The  instruction 
register  specifiers  are  used  to  access  the  cunent  source  exponents.  For  an  add  or  subtract  operation  it  is 
necessary  to  determine  which  exponent  is  largest,  so  that  the  binary  points  of  the  numbers  can  be  aligned. 
It  is  important  to  determine  this  quickly,  as  the  fraction  unit  cannot  begin  the  addition  or  subtraction  until 
the  alignment  is  complete.  Therefore,  the  difference  between  the  two  exponents  is  determined  simultane¬ 
ously  using  the  two  subtracters  shown,  Ea-Eb  and  Eb-Ea.  The  reason  for  executing  both  subtractions 
simultaneously  is  that  although  only  one  subtraction  will  determine  which  exponent  is  largest,  the  positive 
difference  between  the  largest  and  the  smallest  is  needed  for  alignment,  and  it  may  be  that  the  one 
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Bus  A  BusB 


subtraction  instead  gave  a  negative  number.  Thus,  performing  both  subtractions  at  once  ensures  that  the 
proper  difference  will  be  available  immediately.  The  multiplexor,  MuxEb>Ea,  selects  which  of  the  differ¬ 
ences  obtained  from  the  two  subtracters  is  to  be  used  as  the  shift  amount  for  the  fraction  unit.  Two  other 
lines,  Eb>Ea  and  Ed>128,  are  also  used  by  the  fraction  unit  The  first  signal  indicates  that  the  exponent  on 
the  B-bus  is  larger  than  that  on  the  A-bus.  This  is  necessary  as  the  fraction  datapath  assumes  that  the 
operand  with  the  greater  exponent  comes  from  the  A-bus,  and  if  this  is  not  the  case  a  sw^  must  be  per¬ 
formed  by  the  fraction  unit.  The  Ed>128  signal  indicates  that  the  difference  between  the  exponents  was 
greater  than  128  (7  bits).  Once  the  addition  or  subtraction  is  complete,  the  Eg-i-/-El  adder/subtractor  is  used 
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to  adjust  the  result  exponent  by  the  correct  normalization  distance.  As  the  adder/subtractor  assumes  that 
the  left  operand  is  the  greater  exponent  and  the  right  operand  is  the  normalization  distance  of  the  result,  the 
first  multiplexor.  MuxEg.  is  used  to  select  the  greater  exponent  and  the  second  multiplexor.  MuxNDist,  is 
used  to  select  the  normalization  distance.  The  normalized  result  is  then  latched  into  EresLatch,  and  is  ulti¬ 
mately  put  onto  the  B-bus. 

The  same  datapath  is  used  to  perform  the  necessary  tasks  for  a  multiply  or  divide  operation.  The 
Eg+/-El  adder/subtractor  is  used  to  calculate  the  sum  of  the  exponents  for  multiplication,  and  the  difference 
of  the  exponents  for  division.  This  adder  is  then  used  again  to  normalize  this  result.  Unlike  the  operation 
above  for  an  add/subtract  instruction,  the  left  operand  to  the  adder  is  now  the  smaller  exponent  Thus  the 
second  mulUplexor  is  used  to  select  the  smaller  exponent  rather  than  the  normalization  distance. 

3.1.2.  Fraction  Unit 

The  main  function  of  the  64-bit  fraction  unit  [1]  as  shown  in  Figure  3-2  is  to  perform  the  addition  or 
subtiacUon  of  the  floating-point  fraction  (magnitude).  Symmeuical  with  the  operation  of  the  exponent  unit, 
the  fraction  portion  of  the  data  is  loaded  into  the  register  file,  again  going  through  any  necessary  unpacking 
and  conversion  to  the  extended  format.  The  register  specifiers  access  the  current  source  magnitudes  from 
the  register  file,  and  prepare  to  perform  the  specified  addiuon  or  subtracuon.  To  do  this,  however,  requires 
that  the  binary  point  of  the  two  magnitudes  be  aligned.  As  mentioned  earlier,  the  fraction  unit  receives  the 
required  shift  amount  for  this  alignment  from  the  exponent  unit,  along  with  a  signal  indicating  which 
exponent  is  the  largest  If  this  signal  indicates  that  the  operand  with  the  greater  exponent  is  not  on  the  A- 
bus,  a  multiplexOT  is  used  to  swap  the  two  operands.  The  barrel  shifter  is  then  used  to  shift  the  greater  mag¬ 
nitude  right  until  alignment  of  the  two  operands  is  achieved,  as  determined  by  the  difference  of  their 
exponents.  The  information  lost  during  the  right  shift  can  be  condensed  into  the  three  GRS  bits  (guard, 
round,  and  sticky).  The  guard  and  round  bits  are  the  two  most-significant  bits  shifted  out,  and  the  sticky  bit 
is  the  logical  or  of  all  of  the  rest  of  the  lost  bits.  This  bit  indicates  whether  any  precision  was  lost  in  the 
shift  operation,  or  if  the  bits  shifted  out  were  all  zeros. 

Once  the  operands  are  aligned,  the  addiuon  or  subtracuon  is  performed  and  clocked  into  an  inter- 
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Figure  3-2  Fraction  Datapath 


mediate  latch.  To  complete  the  operation,  rounding  and  normalization  are  done  using  the  three  GRS  bits,  a 
rounding  PLA,  and  an  incrementor.  If  necessary,  a  normalization  distance  is  sent  to  the  exponent  unit  for 
normalizing  the  final  result  exponent.  The  final  result  is  ultimately  put  onto  the  B-bus. 
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3.U.  Multiply-Divide  Unit 

A  simplified  diagram  of  the  multiply-divide  unit  [3]  is  depicted  in  Figure  3-3.  Due  to  area  con¬ 
straints  on  the  chip,  the  64x64-bit  multiply  is  implemented  as  an  iterative  64x8-bit  multiply,  with  the  partial 
sum  and  carry  vectors  being  accumulated  in  the  PPS-SLatth  and  PPC-SLatch.  respectively.  A  large  cany- 
look-ahead  adder  is  required  to  add  these  two  vectors,  forming  the  final  product  Again  due  to  area  con¬ 
straints.  the  adder  already  existing  in  the  fraction  unit  was  borrowed  for  this  purpose.  This  adder  is  also 
used  for  calculating  the  complement  of  the  multiplicand/divisor. 

As  shown  in  the  diagram,  the  muluplier  (MPR)  is  latched  from  the  A-bus.  and  the  complement  of  the 
multiplicand  (COMPMCD).  along  with  the  multiplicand  (MCD)  itself,  is  latched  from  the  B-bus.  A  ver¬ 
sion  of  the  Booth  recode  algorithm  is  used  here,  which  takes  in  turn  each  byte  of  the  multiplier  (including 
the  most-significant  bit  from  the  previous  byte)  and  groups  the  byte  into  four  overlapped  triplets.  Each  of 
the  triplets  is  decoded  by  the  Booth  logic  to  select  one  of  five  inputs  to  the  carry-save-adder  (CSA)  tree: 
zero.  MCD.  2MCD.  COMPMCD.  and  2COMPMCD.  The  input  selected  by  the  least  significant  triplet  is 
btched  direcUy  into  its  associated  latch,  the  input  selected  by  the  next  significant  triplet  is  shifted  left  two 
bits,  and  so  on.  The  four  (shifted)  inputs  selected  from  the  four  triplets  are  added  to  the  accumulated  par¬ 
tial  sum  and  carry  vectors  from  the  previous  iterations  using  the  4-stage  CSA  tree  to  obtain  the  new  partial 
sum  and  carry  vectors.  To  avoid  using  a  128-bit  datapath,  the  partial  sum  and  carry  vectors  are  shifted 
right  by  eight  with  each  iteration.  A  ’rounding’  adder  accumulates  the  bits  which  are  shifted  out  As  stated 
above,  the  final  sum  and  carry  vectors  are  added  in  the  fraction  unit  to  obtain  the  product. 

The  divide  qjcration  is  performed  using  an  iterative  non-restoring  algorithm  [3].  where  two  bits  of 
the  quotient  are  determined  per  iteration.  The  same  datapath  is  used  for  both  multiply  and  divide,  and 
much  of  the  hardware  is  shared.  The  multiplexor  preceding  the  PPS-SLatch  allows  the  PPS  slave  latch  to 
be  loaded  either  with  the  contents  of  the  master  latch  or  the  contents  of  the  A-bus.  For  a  divide  operation, 
the  A-bus  is  used  to  initially  load  the  dividend  into  the  latch.  As  opposed  to  the  multiply  tqjeration  which 
shifts  right  by  eight,  the  PPS  and  PPC  latches,  which  hold  the  partial  remainders,  are  shifted  left  two  bits 
with  each  iteration.  Under  the  algorithm,  six  bits  of  partial  remainder  and  four  bits  of  divisor  are  used  by 
the  quotient  logic  to  select  the  next  two  bits  of  the  quotient. 
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3^.  Control  Implementation  Considerations 

The  random  logic  design  associated  with  the  datapath  and  memory  control  is  definitely  the  most 
challenging  pan  of  the  control  unit  implemenution.  This  is  mainly  because  there  is  little  inherent  structure 
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in  (and  between)  the  individual  cells,  and  each  cell  needs  to  be  fine-tuned  for  its  particular  applicanon. 
That  is.  each  signal  must  be  buffered  depending  on  the  capacitance  of  the  line  it  is  driving  to  ensure  that 
there  will  be  minimal  propagation  delays,  and  the  delays  must  be  balanced  such  that  clock  skew  is  avoided. 
The  following  secUons  contain  a  detailed  discussion  of  these  considerations,  including  a  study  of  alterna¬ 
tive  implementations  and  support  of  design  decisions. 

3.2.1.  Layout  Strategy 

The  features  I  strove  for  in  the  design  of  the  random  logic  included  layout  structure  and  regularity, 
speed  optimization,  and  area  minimization.  Structure  and  regularity  arc  especially  instrumental  in  achiev¬ 
ing  layout  that  is  easy  to  generate,  simulate,  debug  and  modify.  In  particular,  it  is  much  easier  to  simulate 
and  change  one  ceU  that  is  used  several  times,  rather  than  individually  testing  and  modifying  several 
unique  cells.  These  features  are  also  important  in  terms  of  cell  interconnection  and  routing.  Several  error- 
prone  steps  can  be  avoided  simply  by  arranging  the  basic  cells  such  that  most  of  the  interconnecuons 
between  cells,  particularly  the  power  supply  lines,  are  made  automatically.  This  reduces  the  chances  of 
missing  interconnections,  and  the  regularity  generally  allows  such  flagrant  errors  to  be  spotted  on  layout 

plots. 

As  always,  there  are  trade-offs  involved  in  any  design  problem.  The  questions  of  area  minimization 
and  speed  opumizauon  are  often  opposed  to  one  another,  and  both  may  be  exposed  to  the  concept  of  struc¬ 
ture  and  regularity.  For  example,  minimizing  the  area  and  associated  delays  for  each  signal  requires  the 
individual  design  and  optimizauon  of  each  random  logic  cell,  contrary  to  the  features  desired  above.  Like¬ 
wise,  all  cells  could  be  designed  minimum-sized,  but  would  not  operate  at  the  optimum  speed.  It  has  been 
shown  [10]  that  in  order  to  minimize  the  delay  for  a  single  stage,  the  ratio  of  the  stage’s  output  capacitance 
to  input  capacitance  should  be  approximately  «.  As  an  example  of  the  use  of  this  rule,  a  chain  of  inverters 
(buffers)  is  often  used  to  drive  a  large  load  capacitance.  If  an  unlimited  number  of  buffer  stages  may  be 
used,  the  first  inverter  is  designed  with  minimum-size  transisicff  channels,  the  second  inverter’s  channels 
arc  about  three  times  larger,  and  so  on.  However,  if  the  number  of  stages  is  Umited,  then  the  optimum 
soluuon  is  to  scale  each  of  the  succeeding  stages  by  the  fanout  factor/,  where  Y=/n.  Here,  Y  is  defined 
as  the  rauo  of  the  total  load  capacitance  to  the  input  gate  capacitance  of  the  first  stage,  and  N  is  the  total 
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number  of  stages.  Another  consideration  here  is  the  concept  of  delay  balancing  and  clock  skew,  which  is 
discussed  in  detail  in  Section  3.2.4. 

3.2.2.  Alternative  Implementations 

Looking  for  some  degree  of  strucuue  in  the  random  cell  layout,  I  studied  the  various  forms  of  logic 
needed  for  the  datapath  and  memory  control  signals  (see  Appendix  D).  After  some  manipulation,  most  of 
the  logic  was  transformed  into  a  two-level  and-or  form  (or  nand-nand  as  shown  in  Section  3.23),  giving 
me  a  template  upon  which  to  base  my  cell  design.  This  led  to  the  consideration  of  several  methods  of 
implementation.  The  and-or  format  naturally  suggested  a  dynamic  or  domino  ceU  [5],  whereas  the  nand- 
nand  format  suggested  a  static  layout  using  a  library  of  nand  cells.  Pure  domino  logic  was  rejected  on  the 
basis  that  many  of  the  logic  blocks  are  used  in  successive  phases  for  some  operations,  leaving  no  time  for 
pre-charge.  However,  a  pseudo-nMOS  implementation  seemed  a  likely  extension.  The  following  sections 
present  a  pseudo-nMOS  and  a  full  static  implementation  studied  in  detail,  concluding  with  a  comparison  of 
the  two  methods  and  the  selection  of  the  fuU  static  design  for  the  random  logic  implementation. 

32.2.1.  Pseudo-nMOS 

As  shown  in  Figure  3-4,  the  pscudo-nmos  form  adapts  readily  to  the  and-or  format,  and  can  easily  be 
extended  to  any  number  of  inputs  corresponding  to  this  format.  From  the  layout  plots  included  in  Appen¬ 
dix  C,  we  can  see  that  this  design  style  provides  both  economy  of  space  and  a  well-structured  design.  The 
area  savings  is  mainly  provided  by  the  fact  that  only  two  p-channel  pull-ups  are  required,  whereas  in  typi¬ 
cal  static  circuits  a  full  complementary  design  is  employed  -  providing  one  p-channel  pull-up  for  every  n- 
channel  pull-down.  However,  there  arc  many  important  drawbacks  to  this  type  of  ratioed  logic,  including 
increased  power  consumption  and  charge-sharing  considerations. 

To  begin,  several  hand-calculations  were  performed  to  establish  a  ratio  factor  for  the  cells.  Given 
that  it  was  desirable  to  keep  the  stauc  pull-up  transistor  as  small  as  possible  to  decrease  the  amount  of 
power  consumed.  1  wanted  to  find  the  equivalent  width  of  the  pull-down  transistors  required  to  guarantee 
that  they  were  strong  enough  to  pull  the  pre-charged  node  (V,)  down  below  a  specified  safety  voltage  to 
avoid  any  chance  of  accidentally  triggering  the  succeeding  stage.  At  the  time,  we  set  the  safety  voltage  at 
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Figure  3-4  Pseudo-nMOS  Design  Style 


0.5V.  given  that  the  threshold  voltage  of  the  following  stage  is  about  0.75V  and  including  a  slight  safety 
margin,  and  derived  the  ratio  factor  using  this  value.  Obviously,  the  bigger  the  p-channel  device,  the  more 
current  it  will  source,  the  more  power  it  wiU  consume,  and  the  harder  it  will  be  for  the  n-channel  devices  to 
overcome  it  to  pull  the  node  down  towards  ground.  No  matter  how  many  ofui  strings  I  have  hanging  off  of 
the  V,  node,  the  worst  case  is  where  only  one  of  them  is  turned  on  (assuming  all  and  branches  have 
equivalent  n-channel  widths),  thus  giving  less  current  to  ground  than  the  case  where  two  or  more  branches 
arc  turned  on.  This  assumption  wiU  give  a  conservative  estimate  for  the  actual  size  needed,  as  some  leak¬ 
age  wUl  occur  among  the  off  branches  thus  helping  to  lower  the  node  voltage.  For  this  case.  I  can  easily 
solve  for  the  steady-state  voltage  at  the  pre-charged  node  using  the  fact  that  the  source  (puU-up)  current  is 
equal  to  the  sink  (pull-down)  current.  The  calculation  proceeds  as  follows. 

PMOS:  V„  =  -5  V;  Vr  =  -0.75  V;  V*  =  V,-5 
Vj.  =  -4.25  V.  (Assume  V,  <  0.75  V.) 

Therefore.  Vj,  <  -4.25  V.  (The  transistor  is  saturated.) 

I,^  =  h^^(y^,~VrY  'Nhm,KPp=  76m. 

Assume  that  for  a  string  of  n-channel  gates,  the  equivalent  W/L  rauo  for  the  string  is  the  mdividual 
ratio  divided  by  the  number  of  input  gates  in  the  string. 


NMOS:  Vf,  =  5V;  Vt  =  0.75V;  V*  =  V, 
V^,-Vt  =  4.25V.  (AssumeV,  <  0.75V. ) 
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Therefore,  <  V,,-Vr.  (TTie  transistor  is  linear.) 
=  ^=-^[2(Vp-Vr)V^-V*^  where  KP^^Hlu. 

Setting  these  currents  equal,  and  setting  Vx=0.5  V: 


^^(-425)2=  ^-5^t2(4.25)(0.5H0.5)2] 


W'p  _  152  _og73 

Therefore,  if  W,  is  minimum  size  (4X).  then  W,  must  be  approximately  6.42X.  Solving  the  quadratic 
equation  for  V,  given  W,  =  4  and  W.  =  6  gives  us  V,  =  0.54V.  This  was  stiU  considered  acceptable,  giv¬ 
ing  a  safety  margin  of  021V. 

Many  cells  were  layed  out  and  simulated  using  this  ratio  factor.  Table  3-1  lists  the  results  for  some 
of  the  more  common  configurations  used  for  comparison  with  the  fuU  static  method,  which  is  discussed  m 

the  next  section. 


32.2.2.  FuU  Static  CMOS 

A  typical  full-complementary  static  design  is  presented  in  Figure  3-5.  It  is  noticeably  more  complex 
than  the  pseudo-nMOS  design,  due  to  the  multiple  pull-ups,  the  two-level  design,  and  the  extra  routing. 
However,  this  static  design  has  none  of  the  ratio,  noise  margin,  charge  sharing,  and  power  consumption 
problems  associated  with  the  previous  design. 

A  library  of  ceUs  was  constructed  containing  the  basic  building  blocks  needed  to  generate  any  of  the 
specified  control  signals.  Following  the  previous  discussion  I  tried  to  minimize  the  delay  between  the  logic 
levels,  and  therefore  designed  ceUs  suited  fw  a  specific  level.  That  is,  all  first-level  cells  were  built  with 
minimum-size  channel  widths  and  aU  second-level  cells  have  channel  widths  about  three  times  this  size. 
There  were  only  a  few  instances  where  a  third  level  of  logic  was  required  -  these  were  sized  on  an  indivi¬ 
dual  basig  Since  there  are  so  many  different  control  signals  driving  a  wide  range  of  capacitances,  it  would 
be  a  difficult  problem  to  individually  optimize  each  level  of  each  cell.  Thus  the  factor  of  three  was  chosen 
as  a  universal  approximation,  the  discrepancies  in  delay  are  taken  up  in  the  double  buffer  stage  as  dis¬ 
cussed  in  Secuon  32.4.  To  maintain  the  goals  of  structure  and  regularity,  the  levels  were  designed  such 
that  inputs  flow  in  from  the  top  in  metal-l,  and  the  logic  outputs  flow  out  the  bottom  in  metal-l. 
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interconnections  between  levels  being  performed  automatically  (see  Appendix  C).  As  most  of  the  cells 
used  metal-2  only  for  Vdd  and  GND  routing,  many  channels  are  left  free  and  can  be  used  for  global  rout¬ 
ing. 

3J.2.3.  Comparison 

A  comparison  of  various  parameters  of  interest  for  the  two  design  styles  discussed  above  is  presented 
in  Table  3-1  below.  In  order  to  make  the  comparisons  as  accurate  as  possible,  all  simulations  were  per¬ 
formed  assuming  a  typical  datapath  load  capacitance  of  3.0pF  (from  tables  in  Appendix  E)  and  double 
buffering  of  603l  (n-channel  width).  The  first  column  is  the  function  name  of  the  cell  being  simulated, 
where  the  cells  are  ordered  in  groups  of  the  two  alternatives.  In  all  cases,  the  top  cell  (and-or  structure)  is 
the  pseudo-nMOS  implementation,  and  the  bottom  cell  {nand-nand  structure)  is  the  full-complementary 
static  implementation.  As  shown  in  the  following  section,  these  two  implementations  can  be  shown  to  be 
functionally  equivalent  through  repeated  applications  of  DeMorgan  s  theorems.  See  Apjjendix  C,  Figures 
A-CI-CIO,  for  a  comparison  of  the  actual  layout  plots  implementing  the  following  functions. 

The  second  column  indicates  the  number  of  transistors  required  to  implement  the  given  function. 
For  the  simplest  function  shown,  a  2-input  and  gate  (2andlor  ot  2nandlnand),  less  transistors  are  required 
for  the  static  implementation  than  pseudo-nMOS.  This  can  be  explained  by  the  inverter  overhead  required 
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Table  3-1  Comparison  of  Pseudo-nMOS  Versus  Full  Static  CMOS 


Function 

Devices 

H(X) 

W(X) 

Vl^(V) 

Vh_(V) 

tp.  (ns) 

(ns) 

2andlor 

2nandlnand 

5 

4 

63 

147 

28 

24 

0.420 

0.000 

4.908 

5.000 

A5 

5.0 

6.0 

5.0 

2and2or 

'>nanH7nand 

7 

12 

63 

147 

43 

48 

0.420 

0.000 

4.926 

5.000 

5.0 

5.5 

6.0 

5.5 

2and3or 

2nand3nand 

9 

18 

63 

147 

46 

72 

0.418 

0.000 

4.834 

5.000 

5.0 

6.0 

6.5 

6.0 

2and4or 

2nand4nand 

11 

24 

63 

147 

60 

96 

0.417 

0.000 

4.697 

5.000 

5.0 

6.5 

13 

6.0 

2and5or 

2nand5nand 

13 

30 

63 

147 

74 

120 

0.417 

0.000 

4.660 

5.000 

5.0 

7.0 

13 

6.0 

3and2or 

3nand2nand 

9 

16 

63 

147 

48 

64 

0.473 

0.000 

4.704 

5.000 

5.0 

6.5 

7.0 

5.5 

for  the  pseudo-nMOS  implementation.  However,  for  most  of  the  configurations,  the  number  of  transistors 
needed  for  the  static  implementation  is  around  twice  that  of  the  pseudo-nMOS  requuemenL  This  is 
because  the  static  method  requires  two  transistors  for  every  additional  inpuu  whereas  only  one  is  required 
using  pseudo-nMOS.  Fuithennore.  the  staUc  implementation  has  the  additional  overhead  of  the  second 
level  of  logic,  which  also  has  two  transistors  for  every  inpuU  while  the  pseudo-nMOS  has  only  the  over¬ 
head  of  the  inverter  in  all  cases.  The  most  commonly  used  cell  type  is  the  2-input  and  gate,  which  is  used 
about  twice  as  many  times  as  the  next  three  configurations  in  the  uble  and  about  five  times  as  often  as  the 

last  two  configurations. 

The  height  and  width  of  the  actual  layout  cells  required  by  the  functions  are  given  in  columns  three 
and  four.  As  shown,  the  pseudo-nMOS  uses  much  less  chip  area,  attributable  to  three  facts.  In  the  first 
case,  less  devices  take  less  area,  especially  since  basicaUy  only  one  type  of  channel  is  used  (n-channel)  and 
thus  the  overhead  of  the  tub  separation  design  rule  is  reduced.  Secondly,  the  implementation  allowed  the 
layout  to  be  resolved  using  a  single  level,  reducing  interconnect  overhead.  Finally,  the  and  strings  using 
the  pseudo-nMOS  method  could  be  placed  immediately  adjacent  to  their  neighbor  string,  thus  sharing 
many  ground  and  node  connections.  In  contrast,  the  use  of  nand  ceUs  from  a  general  library  m  the  stauc 
implementation  required  that  the  individual  cells  be  separated  by  a  minimum  distance  of  AX,  and  no  con¬ 
nections  could  be  shared.  The  main  consideration  here  was  to  try  to  keep  the  logic  cells  tall  and  narrow,  so 
that  they  could  all  be  placed  directly  above  the  datapath  cells  where  required.  As  shown,  the  pseudo- 


nMOS  method  best  met  this  goal. 
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The  next  two  columns  show  the  noise  margin  characteristics  of  the  cells.  As  expected  from  a  full 
CMOS  implementation,  the  static  style  obtained  full  restoration  of  the  voltage  levels.  The  pscudo-nMOS 
method  also  performed  as  expected,  with  the  low  voltage  being  slightly  less  than  the  0.5V  threshold  limit 
The  high  voltage  indicates  some  charge  sharing,  though  the  amount  also  remains  within  the  0.5V  specified 

safety  threshold. 

Propagation  delay  times  are  given  in  the  final  two  columns.  All  times  are  comparable,  with  static 
low-to-high  (rise)  times  usually  slower  and  high-to-low  (fall)  times  usually  faster  than  the  pseudo-nMOS 
times.  The  pseudo-nMOS  fall  times  will  always  increase  with  the  addition  of  and  strings,  as  there  is  essen¬ 
tially  a  limited  supply  of  current  available  through  the  minimum-size  p-channel  source  and  an  increasing 
amount  of  capacitance  to  charge.  The  rise  times  are  dependent  on  the  width  of  the  n-channel  sinks,  which 
could  be  increased  with  a  small  area  penalty.  However,  they  cannot  be  decreased  without  upsetting  the 
ratio  balance,  so  it  is  difficult  to  reduce  skews  at  this  level  by  balancing  the  rise  and  fall  times.  The  static 
cells  are  designed  with  a  2:1  p-channel  to  n-channel  ratio  in  order  to  provide  automatic  balancing  of  the 
rise  and  fall  times.  As  seen,  a  better  natural  balance  could  possibly  be  obtained  by  increasing  this  ratio 
towards  2.5:1  (this  is  dependent  on  the  mobilities  of  the  p-  and  n-type  materials,  which  is  a  processing 

parameter). 

The  decision  to  proceed  with  the  "safe"  static  design  came  after  much  discussion  [2],  although  the 
area  required  for  this  implementation  was  shown  to  be  much  larger.  The  basic  argument  in  support  of  this 
decision  was  the  fact  that  the  extra  area  needed  was  available  on  the  chip,  and  so  the  "safer”  design  seemed 
appropriate.  Besides  significantly  decreasing  design  time,  the  use  of  library  cells  also  aided  the  goal  of 
structured  and  regular  layout,  providing  an  easy  way  to  add  new  cells  or  quickly  modify  existing  ones  sim¬ 
ply  by  replacing  one  library  cell  with  another.  No  major  penalties  were  taken  with  the  rise  and  fall  times, 
either,  as  both  were  comparable,  and  the  buffer  sizes  can  still  be  optimized  to  reduce  the  times  shown  here. 

Looking  back,  I  believe  that  the  pseudo-nMOS  implementation  would  have  been  quite  suitable  for 
this  application,  though  at  the  expense  of  an  increased  design  cycle.  In  particular,  the  safety  voltage  thres¬ 
hold  should  be  lowered  to  allow  an  even  greater  noise  margin.  Specifying  a  safety  threshold  of  0.25V 
would  have  involved  increasing  the  n-channel  sizes  such  that  a  ratio  of  12.44:4  is  maintained  between  the 
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n-channel  width  and  the  minimum-size  jxhannel  width.  However,  this  would  also  have  served  to  increase 
the  charge-sharing  problem,  as  increasing  the  n-channel  width  increases  the  amount  of  c^acitance  m  the 
and  strings  available  for  charge  sharing.  TOs  could  then  be  alleviated  by  inaeasing  the  size  of  the  pre¬ 
charge  node,  possibly  by  increasing  the  size  of  the  output  inverter  associated  with  this  implementation  and 
thus  increasing  its  input  gate  capacitance.  However,  this  also  has  its  penalty,  as  increasing  the  size  of  the 
pre-charge  node  increases  the  circuit  delay.  Many  more  SPICE  simulations  (and  cosUy  design  time!) 
would  have  been  necessary  to  ensure  safe  operation  using  the  pseudo-nMOS  method,  and  safety  is  espe¬ 
cially  important  in  control  implementation. 

3.2.3.  Logic  Definition  and  Conversion 

Most  of  the  random  control  signals  defined  in  the  SLANG  description  of  the  floating-point  unit  were 
convened  to  a  nand-nand  formal  for  implementation  using  DeMorgan’s  theorems.  These  theorems  can  be 
stated  simply  as  (AB)*  =  A*+B*  and  (A+B)*  =  A*B*  [9].  where  denotes  the  or  function  and  indi¬ 
cates  complementation.  Two  typical  examples  of  the  sequence  of  steps  taken  to  define  and  implement  the 
random  logic  are  shown  below.  The  first  example,  cm-latch-master,  iUustiates  one  of  the  most  common 
logic  forms  implemented:  the  2nand2nand  structure. 

(defnode  cm-latch-masier 
(depends  ctrl-mul/div-latch  phi2  phi4) 

(update  (And  (Or  phi2  phi4)  ctrl-mul/div-latch)) 

) 

Example  3-1  SLANG  Description  of  cm-latch-master 

From  Example  3-1.  we  see  that  cm-latch-master  =  (phi2+phi4)ctrl-mul/div-latch.  This  can  be 
direcUy  expanded  to  (phi2*ctrl-muVdiv-latch)  +  (phi4*ctrl-muVdiv-latch).  bringing  us  to  the  and-or 
representation.  To  use  the  theorems  given  above,  we  can  let  A  =  (phi2*ctrl-mul/div-lalch)*  and  B  = 
(phi4*ctrl-mul/div-latch)*.  This  gives  us  an  and-or  form  of  the  type  A*+B*.  Using  the  first  theorem,  we 
have  A*+B*  =  (AB)*.  or  cm-latch-master  =  ((phi2*ctrI-muVdiv-latch)*(phi4*ctrl-muVdiv-latch)*)*.  which 
can  be  recognized  as  the  nand-nand  representauon.  Both  of  these  forms  are  depicted  in  Figure  3-6. 
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Figure  3-6  Equivalent  Representations  of  cm-latch-master 


As  another  example,  I  considered  the  SLANG  definition  of  cm-laich-slave.  Unlike  most  of  the  con¬ 
trol  signals,  cm-latch-slave  is  unique  in  that  I  implemented  it  using  three-level  logic,  as  illustrated  m  Figure 
3-7.  It  was  possible  to  reduce  cm-latch-slave  to  a  two-level  implementation  using  DeMorgan’s  theorems, 
but  this  required  4  invertors,  one  2-input  nand  gate,  eight  4-input  nand  gates,  and  one  9-input  nand  gate! 
Generally,  nand  gates  are  limited  to  at  most  five  inputs  as  using  more  inputs  leads  to  either  a  large  series 
resistance  to  ground  or  a  huge  input  gate  capacitance,  both  of  which  increase  the  delay  times.  TTierefore, 
the  three-level  implementation  was  considered  more  appropriate,  and  also  required  only  two  different  types 

of  nand  gates. 

(defnode  cm-latch-slave 

(depends  ctrl-mul/div-latch  phil  phB  clock6_mulop  clocklS  cirl-latch-ops) 
(update  (Or(And(Orphilphi3) 

(Not  (And  phi3  clock6_mulop)) 

(Not  (And  phil  clocklS)) 
ctrl-mul/div-latch) 
ctrl-latch-ops 
))) 

Example  3-2  SLANG  Description  of  cm-latch-slave 

To  begin.  I  convened  phiUphB  to  the  nand  type,  again  using  the  ta  theorem.  By  letting  A-  -  phil 
and  B>  -  phi3.  we  get  that  A'+B*  -  (AB)*.  or  (phil>phi3*)-.  The  last  two  levels  of  logic  are  nealed  as  in 

the  example  above. 

Appendix  D  contains  a  complete  listing  of  aU  of  the  control  signal  definiuons  as  derived  from  the 
SLANG  description,  along  with  their  equivalent  nand-nand  representauon.  Except  in  special  instances,  all 
signals  were  implemented  using  nand-nand  logic. 
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Figure  3-7  Equivalent  Representations  of  cm-latch-slave 


32A.  Clock  Skew  Considerations 

The  minimization  of  clock  skew  is  one  of  the  most  important  considerations  in  the  design  of  the  ran¬ 
dom  control  logic.  Oock  skew  [1 1]  is  the  delay  in  a  clock  signal  from  its  point  of  generation  to  the  pomt  at 
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Figure  3-8  Typical  Qock  Skew  Example  with  Non-Overlapping  Clocks 


which  it  is  used,  and  is  due  to  the  resisUve  and  capacitive  delays  incurred  in  the  travel  between  the  two 
points.  The  effects  of  this  delay  between  clock  signals  can  be  clearly  seen  in  the  case  of  two  latches,  as 
shown  in  Figure  3-8.  (The  example  shown  assumes  a  two-phase  non-overlappmg  clock,  but  the  concept 
can  easUy  be  extended  for  the  case  of  a  four-phase  non-overlapping  clock.)  Given  the  ideal  timing  diagram 
for  Phil  and  phi2,  we  see  that  both  latches  will  never  be  operating  simultaneously,  thus  providing  the  isola- 
uon  needed  for  correct  operauon.  That  is,  the  combinational  logic  determines  a  value  which  is  clocked  into 
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the  second  latch  on  the  falling  edge  of  phi2,  and  on  the  falling  edge  of  phil  the  new  value  is  clocked  into 
the  fiist  latch  where  it  serves  as  an  input  to  the  combinational  logic.  A  problem  arises  only  when  the  feed¬ 
back  loop  to  the  combinational  logic  is  closed,  resulting  in  a  form  of  race  condition.  In  this  case,  it 
becomes  possible  for  the  combinational  logic  to  receive  as  input  the  value  which  it  has  just  determined, 
causing  the  value  to  be  updated  at  least  twice  within  one  cycle.  This  condition  could  occur  only  if  one  of 
the  signals  were  delayed  (or  in  other  words,  the  signals  phil  and  phi2  were  skewed)  to  such  an  extent  that 
the  non-overlap  time  between  the  phases  disappeared. 

Since  it  is  obvious  that  delays  are  going  to  occur  in  the  clock  lines,  the  best  way  to  handle  clock  skew 
is  to  balance  the  delays  such  that  the  relative  clocking  scheme  is  preserved.  That  is,  given  the  worst-case 
delay  of  all  of  the  control  signals,  it  is  desirable  to  keep  the  delay  for  all  signals  as  closely  bounded  to  this 
delay  as  possible,  and  the  non-overlap  time  must  be  longer  than  the  maximum  variation  in  delay  between 
the  signals.  For  high  performance,  it  is  necessary  to  keep  this  non-overlap  time  as  short  as  possible  m 
order  to  minimize  the  clock  cycle. 

Most  of  the  datapath  control  logic  uses  a  type  of  gated  clock  [11],  which  is  formed  by  the  logical  and 
of  a  control  signal  and  one  of  the  clock  phases.  The  same  idea  applies  here,  though  in  actual  implementa¬ 
tion  is  it  now  important  to  consider  the  total  delay  from  the  generauon  of  the  clock  signal,  through  the  ran¬ 
dom  logic,  to  each  latch  in  the  datapath.  Given  the  huge  datapaths  in  the  FPU  (>  70  bits),  this  delay  can  be 
considerable.  That  is.  although  the  random  logic  may  be  balanced  such  that  each  logic  ceU  has  the  same 
delay,  if  the  delays  to  the  last  latch  on  the  datapath  which  the  ceUs  are  driving  are  significanUy  different 
clock  skew  can  still  occur.  In  order  to  take  this  into  account,  each  random  logic  cell  and  its  associated 
buffers  are  scaled  to  drive  their  particular  load  capacitances  at  approximately  the  same  rate  [3]. 

To  implement  the  delay  balancing  required  individual  attention  far  each  control  signal  along  the 
datapath,  as  the  capacitances  varied  over  a  huge  range,  and  often  different  levels  of  logic  were  used  in  gen¬ 
erating  the  control  signals,  thus  providing  various  levels  of  driving  abdity.  Appendix  E  contains  tables  of 
the  extracted  control  line  capacitances  for  the  three  main  datapath  modules,  along  with  their  chosen  buffer 
sizes  and  the  total  delay.  As  shown,  the  largest  capacitive  load  occurs  in  the  multiply-divide  carry-save- 
adder  tree  and  is  approximately  1 1.3  pF.  The  goal  here  then  is  to  minimize  the  delay  for  this  control  signal 


-34- 


Table  3-2  SPICE  Delays  for  Common  Circuit  ConSgurations 


Function 

Load (pF) 

Buffer  (2.) 

tp.  (ns) 

Per  Stage 

tp^  (ns) 

Per  Stage 

1.0 

40 

3.5 

1.5+1. 0+1.0 

4.0 

2.0+1.0+1.0 

2.0 

40 

4.5 

2.0+13+1.0 

5.0 

2.0+13+1.5 

3.0 

60 

5.0 

2.5+ 1.0+1. 5 

5.0 

2.0+2.0+1.0 

4.0 

60 

5.5 

2.5+1.5+1.5 

6.0 

23+2.0+1.5 

6.0 

6.0 

80 

100 

7.0 

6.0 

2.0+2.0+3.0 

2.5+2.0+1.5 

191 

1.0 

40 

A5 

23+1.0+1.0 

4.5 

2.5+l.Ofl.O 

10 

40 

5.5 

2.5+ 1.5+1. 5 

53 

23+1.5+1.5 

3.0 

60 

5.5 

23+2.0+1.0 

5.0 

2.5+1.0+1.5 

4.0 

60 

6.0 

2.5+1. 5+2.0 

6.0 

2.5+2.0+1.5 

6.0 

100 

6.0 

3.0+2.0+1.0 

5.5 

2.5+1.5+1.5 

8.0 

100 

7.0 

3.0+2.0+2.0 

7.0 

3.0+2.0+2.0 

11 

100 

160 

200 

9.0 

15 

7.5 

3.0+23+3.5 

3.5+2.0+2.0 

4.0+2.0+13 

9.0 

7.0 

8.0 

3.0+3.0+3.0 

3.0+2.5+1.5 

4.0+2.0+2.0 

2nand3i'and 

1.0 

40 

5.0 

3.0+1. 0+1.0 

4.5 

2.5+1.0+1.0 

2.0 

40 

6.0 

2.5+2.0+13 

6.0 

2.5+13+2.0 

3.0 

60 

6.5 

3.0+2.0+1.5 

5.5 

23+1.0+2.0 

IBB 

4.0 

60 

7.0 

3.0+2.0+2.0 

6.5 

6.0 

100 

7.0 

1.0 

40 

5.0 

3.0+1. 0+1.0 

5.0 

2.5+1.5+l.U 

2.0 

40 

6.5 

3.0+2.0+1.5 

6.0 

2.5+2.0+1.5 

BESfli 

40 

60 

8.0 

7.0 

3.0+2.5+2.5 

3.5+2.0+1.5 

7.0 

6.0 

2.5+2.0+2.5 

2.5+2.0+1.5 

60 

■SB 

3.5+2.5+2.0 

23+2.5+2.0 

6.0 

6.0 

60 

100 

9.0 

8.0 

3.5+2.5+3.0 

3.5+2.0+2.5 

■I^B 

2.5+3.0+2.5 

3.0+2.5+1.5 

2nand5nand 

1.0 

40 

6.0 

3.5+1.0+13 

5.0 

2.5+1. 5+1.0 

2.0 

40 

3.5+2.0+1.5 

6.0 

2.5+2.5+1.0 

1 

mm 

8.0 

1.0 

3.5+2.0+2.5 

3.5+23+1.0 

7.0 

6.0 

2.5+23+2.0 

3.0+2.0+1.0 

60 

80 

8.0 

15 

4.0+2.0+2.0 

4.0+13+2.0 

7.0 

6.5 

3.0+23+13 

3.0+2.0+13 

6.0 

6.0 

80 

100 

9.0 

8.0 

4.0+23+2.5 

4.0+2.5+1.5 

73 

7.0 

3.0+23+2.0 

3.0+2.0+2.0 

3nand'^riand 

1.0 

40 

4.5 

2.5+1. 0+1.0 

4.5 

2.5+1. 0+1.0 

2.0 

40 

5.5 

2.5+1.5+13 

5.5 

2.5+2.0+1.5 

3.0 

60 

6.0 

2.5+2.0+1.5 

6.0 

2.5+2.0+1.5 

4.0 

60 

6.5 

3.0+2.0+13 

63 

23+2.0+2.0 

6.0 

100 

7.0 

3.0+2.0+2.0 

6.5 

3.0+2.0+1.5 

(cm-latch-master)  and  to  scale  all  other  delays  to  this  match  this  delay.  As  the  datapath  control  line  capaci¬ 
tances  range  anywhere  from  about  0.5  pF  to  1 1.3  pF.  this  is  a  difficult  task. 
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A  series  of  SPICE  simulations  were  exercised  in  order  to  compile  Table  3-2.  As  shown  here,  a  few 
of  the  more  commonly  used  circuit  types  were  simulated  iteratively  with  various  load  capacitances  and 
buffer  sizes  in  order  to  find  the  configuration  which  most  closely  matched  the  cm-Iatch-master 
(2nand2nand)  delay  of  7.5  ns.  Gt  is  sdU  possible  to  optimize  this  delay  by  increasing  the  strength  of  the  first 
levels  in  order  to  equalize  the  delays  between  each  of  the  stages.)  Given  these  numbers,  the  cells  were 
placed  and  buffers  were  sized  accordingly,  and  Crystal  was  then  used  for  fine-tuning  the  delay  of  each  cell. 
It  is  important  to  note  that  the  numbers  below  are  only  a  crude  ^proximation.  as  the  iterations  were  per¬ 
formed  at  the  SPICE  level,  and  did  not  take  into  account  all  of  the  changes  in  area  and  junction  capaci¬ 
tances  for  the  different  buffer  sizes. 

Since  none  of  the  global  routing  is  complete  at  this  point,  the  method  above  assumes  that  all  clock 
inputs  to  the  random  logic  arrive  at  the  same  time  and  that  the  other  inputs  are  stable  through  the  clock 
transitions.  Ibis  minimizes  additional  skews  that  could  be  incurred  by  random  delays  in  control  inputs. 
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4.  SIMULATION  AND  VERIFICATION 
4.1.  CAD  Environment 

As  stated  in  the  introduction  to  this  report,  much  of  the  control  had  already  been  specified  at  the 
functional  level  and  had  been  simulated  using  SLANG  when  I  joined  this  project.  My  role  thus  involved 
turning  the  SLANG  description  into  working  layout.  A  progression  of  CAD  tools  was  used  in  this  process, 

as  illustrated  in  Figure  4-1. 


Figure  4-1  CAD  Tools  Used  in  the  Control  Development  Cycle 


The  PLAs,  block  diagrams,  and  random  logic  definitions  presented  in  this  report  were  all  derived 
from  the  SLANG  descripuon  of  the  FPU.  The  graphical  layout  editor,  MAGIC  [14.16],  was  then  used  to 
implement  the  SLANG  definitions  at  the  layout  level.  Circuit  descriptions  of  the  vanous  blocks  were 
extracted  from  the  MAGIC  environment,  and  conversion  programs  were  executed  to  obtain  the  necessary 
file  formats  for  the  various  simulators.  Diree  types  of  simulators  were  used  in  verifying  the  control  layout; 
MOSSIM  [4],  CRYSTAL  [13,16],  and  SPICE  [20].  The  first  two  are  high-level  interactive  simulators, 
while.  SPICE  is  a  low-level  batch  simulator.  Appendix  F  contains  a  section  on  each  of  these  simulators, 
detailing  important  hints  on  starting  up  the  simulators  and  including  example  source  files  used  for  initializ¬ 
ing  the  simulation  parameters  and  inputs  nodes. 
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MOSSIM  is  an  interactive  functional  simulator  which  employs  a  very  primitive  circuit  model.  The 
simulator  is  based  on  an  idea  of  the  relative  strengths  and  sizes  of  the  circuit,  where  strength  is  a  function 
of  the  relative  width  of  a  transistor  (to  aU  the  other  transistors  in  the  circuit),  and  size  is  a  function  of  the 
relative  capacitance  of  a  node.  For  the  most  accurate  representation,  it  is  desired  to  allow  as  many  strength 
and  size  definitions  as  the  circuit  requires.  Unfortunately,  MOSSIM  is  limited  to  a  total  of  15  strengths 
and  sizes,  so  a  ratio  facto:  is  used  to  groiq)  transistors  within  a  certain  range  of  widths  under  the  same 
strength,  and  nodes  within  a  certain  range  of  capacitances  under  the  same  size.  This  fact  is  especially 
important  in  dynamic  or  ratioed  circuits.  For  instance,  MOSSIM  may  consider  some  circuits  incoirectly 
simply  because  the  model  has  treated  each  of  the  transistor  sizes  as  the  same  strength  where  in  fact  the 
transistors  are  rauoed  safely,  or  has  grouped  critical  nodes  under  the  same  size  where  in  fact  the  charge- 
shared  node  may  be  quite  larger  than  those  with  whon  it  is  supposedly  sharing  charge.  Therefore,  it  is 
important  when  aeating  the  MOSSIM  input  file  to  keep  the  ratio  factor  as  small  as  possible  for  the  most 
accurate  circuit  representation. 

The  MOSSIM  timing  model  I  used  assumes  that  each  phase  specified  is  as  long  as  required  for  all  the 
nodes  to  stabUize.  This  means  that  if  the  circuit  funcUons  successfully  under  MOSSIM.  it  should  function 
successfully  in  pactice,  given  no  timing  constraints  (an  infinite  clock  cycle).  Once  the  circuit  description 
is  read  into  the  simulator,  inputs  and  control  signals  can  be  changed  interactively,  and  any  specified  nodes 
may  be  watched.  Also,  intermediate  nodes  may  be  forced  to  a  particular  value,  in  order  to  isolate  subsec- 
uons  of  the  circuit  for  further  testing.  There  are  basically  only  three  possible  logic  levels:  high  (1).  low  (0). 
and  intermediate  (X).  The  intermediate  level  can  mean  a  variety  of  things,  such  as  undefined,  charge  shar¬ 
ing,  short,  and  so  on.  By  asserting  various  control  signals,  and  varying  the  inputs  to  the  circuit,  we  can 
immediately  observe  the  levels  on  the  ouqiut  nodes  in  order  to  test  the  functionality  of  the  circuit. 

CRYSTAL  is  an  interactive  simulator  used  for  timing  analysis.  This  simulator  also  follows  a  very 
primitive  circuit  model,  using  signal-flow  analysis  to  find  the  worst-case  delay  paths  for  the  given  circuit 
As  done  in  MOSSIM,  nodes  can  be  forced  to  a  desired  logic  level  in  CM^der  to  direct  the  signal  flow  along 
paths  of  interest,  or  to  rule  out  impossible  paths  which  CRYSTAL  may  consider.  Also,  specified  nodes 
may  be  watched  to  find  the  wOTst-casc  delay  to  that  node,  independent  of  the  delay  of  the  overall  circuit. 
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SPICE  was  used  for  simulating  the  very  low-level  circuit  considerations.  This  includes  noise  margin 
voltages  and  charge  sharing,  low-level  functional  testing,  latch  set-up  and  hold  times,  and  to  provide  accu¬ 
rate  delay  times  for  comparison  with  CRYSTAL’S  somewhat  cruder  model.  This  is  particularly  important 
as  only  CRYSTAL  is  used  for  timing  delays  of  the  bigger  layout  modules. 

Most  of  the  basic  cells  used  in  the  control  layout  were  simulated  first  using  SPICE,  giving  functional 
verification  and  timing  information  at  the  lowest  level  of  layout.  Each  of  the  functional  blocks  were  simu¬ 
lated  in  their  entirety  using  MOSSIM.  which  in  particular  tests  the  interconnecuons  between  the  basic  cells 
and  any  related  effects.  Critical  paths  were  simulated  using  CRYSTAL  for  an  over-all  timing  evaluation. 

If  a  problem  was  found  using  any  of  the  simulators,  it  was  usually  uncovered  and  fixed  at  the  layout  level. 
The  majority  of  these  errors  were  related  to  labelUng.  particularly  not  labelling  some  power  supply  Unes.  or 
accidentally  attaching  the  label  of  a  circuit  node  to  one  of  the  power  supply  lines,  such  that  the  simulator 
believes  it  to  be  shorted  to  that  line  although  the  layout  itself  is  correct  This  happens  often  while  editing  in 
magic,  since  when  layout  is  stretched  or  moved  the  associated  labels  often  are  transferred  to  another 
layer  in  an  intermediate  stage,  and  are  never  restored  to  the  original  layer.  If  the  layout  yielded  no  infor¬ 
mation  about  the  problem.  SPICE  was  used  in  order  to  study  the  voltages  (as  a  funcUon  of  time)  at  indivi¬ 
dual  nodes  in  greater  detail. 

All  in  aU.  the  biggest  problem  I  found  concerning  the  CAD  support  was  the  inconsistency  between 
the  various  tools.  Each  of  the  simulators  requires  a  different  type  of  input  file  for  the  circuit  descnpuon.  all 
of  which  are  derived  from  the  .sim  file.  (CRYSTAL  is  the  friendliest  here,  as  it  actually  uses  the  .sim  file 
as  input.)  For  large  modules,  this  requires  lengthy  conversion  times  and  the  accumulation  of  several  types 
of  huge  data  files,  each  describing  the  same  circuiU  This  is  especially  inefficient  when  a  simulation  result 
indicates  a  problem  which  must  be  changed  in  the  layout  -fhe  circuit  must  then  be  re-extracted,  re¬ 
convened,  and  re-read  into  the  simulator  in  order  to  simulate  the  modified  layout  One  example  I  encoun¬ 
tered,  was  in  the  CRYSTAL  simulauon  of  the  datapath  control  delays  in  order  to  tweak  the  sizing  of  the 
double  buffers.  Each  time  I  adjusted  a  buffer  size  1  had  to  repeat  the  above  procedure  in  order  to  monitor 

the  effects  of  that  change! 

Given  my  experience  with  the  above  problems.  I  definitely  recognize  the  need  for  "smarter"  CAD 
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tools.  For  example,  one  design  group  here  has  completely  automated  the  process  of  random  logic  genera¬ 
tion  through  the  use  of  their  Design  Manager  [17]  and  a  standard  cell  library.  A  program  has  been  written 
which,  given  a  set  of  equations,  generates  a  file  suitable  for  input  to  the  Design  Manager,  which  can  then 
be  used  to  automatically  select  the  appropriate  standard  cells  from  the  library  and  access  placement  and 
routing  programs  to  complete  the  layout.  Another  such  tool  is  EPOXY  [12].  currendy  under  design  here  at 
Berkeley.  EPOXY  actually  takes  the  concept  of  silicon  compUers  one  step  further  -  aimmg  to  not  only 
synthesize  layout  from  a  given  circuit  descripUon  but  to  also  improve  the  performance  of  the  generated  lay¬ 
out  in  order  to  meet  desired  specifications.  Using  this  tool,  it  will  be  possftle  to  model  the  circuit  given 
user-specified  circuit  constraints  and  parameters,  such  as  desired  delays  and  power  consumption  considera¬ 
tions.  layout  area,  and  so  on;  specify  the  layout  style  and  technology  to  be  used;  simulate  the  circuit  as 
specified  and  observe  the  effects;  interacuvely  or  automatically  adjust  the  parameters  and  constraints  until 
the  desired  behavior  is  observed;  and  then  automatically  generate  the  layout. 

4J.  Simulation  Results 

Each  of  the  basic  control  cells  have  been  simulated  at  the  lowest  level  using  SPICE:  detailed  results 
of  the  individual  simulations  are  included  within  this  report  The  three  control  PLAs  have  been  simulated 
using  CRYSTAL,  worst  case  delay  times  are  presented  in  Table  2-1.  MOSSIM  has  been  used  to  function- 
aUy  verify  the  major  blocks  of  the  control  unit:  the  load-store  pipeline,  the  cycle  counter,  and  the  random 
logic  blocks  associated  with  each  datapath  module.  All  of  the  simulauon  results  indicate  full  functionality. 
SPICE  and  CRYSTAL  results  verify  that  the  control  unit  can  meet  the  minimum  specifications  of  a  20ns 
clock  phase,  though  to  ensure  against  clock  skew  the  non-overlap  time  should  remain  at  10ns. 
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5.  CONCLUSIONS 

The  control  unit  implementation  for  the  SPUR  floating-point  coprocessor  has  been  presented.  The 
actual  control  unit  is  divided  into  two  main  sections:  interface  control  and  datapath  control.  The  basic 
blocks  needed  for  the  interface  control  unit  were  designed  first,  taking  about  15%  of  the  total  design  tune; 
routing  between  the  blocks  required  another  5%  of  the  effort.  TTie  random  logic  implementation  comprised 
about  80%  of  the  total  design  time.  The  disparity  in  design  effort  as  shown  here  is  due  to  the  inherent  lack 
of  regularity  in  the  datapath  control  logic,  as  opposed  to  the  interface  control,  which  is  based  upon 
automatically-generated  PLAs. 

A  major  portion  of  the  design  of  the  random  logic  involved  studying  the  alternative  implementations 
avaUable  and  developing  a  structured  approach  to  the  layout  Initially,  a  lot  of  time  was  spent  trying  to 
optimize  each  cell  in  terms  of  area  and  speed,  before  actually  determining  the  placement  of  the  ceUs  and 
their  load  capacitances.  A  better  approach  would  have  been  to  determine  the  area  available  and  then  design 
the  cells  accordingly.  Also,  the  individual  speed  optimization  of  each  cell  was  not  required,  and  in  fact  was 
not  even  desirable.  As  discussed  in  this  report,  clock  skew  between  the  control  fines  is  minimized  when 
the  delays  in  all  of  the  lines  are  balanced.  Therefore,  the  best  approach  would  have  been  to  first  determine 
the  longest  control  fine  delay  in  the  datapath,  and  design  each  cell  to  closely  match  this  delay. 

In  retrospect.  I  feel  that  I  should  have  implemented  the  random  logic  first,  rather  than  the  interface 
control.  The  reason  for  this  is  that  in  placing  and  simulating  the  random  logic  and  datapath,  I  obtamed  a 
much  better  feel  for  the  layout  style  of  the  other  team  members,  design  issues  they  considered,  how  the 
control  logic  interfaces  with  the  datapath,  and  so  on.  With  this  in  mind,  the  implementauon  of  the  interface 
control  would  have  been  much  easier. 

The  rapid  advances  in  VLSI  technology  have  led  to  the  proliferation  of  several  CAD  tools  to  aid  in 
the  various  steps  of  chip  development.  A  lot  of  tedious  and  error-prone  functions,  such  as  routing  and  PLA 
generation,  can  now  be  performed  automaUcaUy.  However,  no  such  tools  were  available  to  SPUR  for  aid¬ 
ing  in  the  design,  generation,  placement,  and  opumization  of  the  random  control  logic.  Much  of  the  design 
effort  above  could  have  been  avoided  if  more  sophisticated  CAD  tools  had  been  available. 
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APPENDIXA:  PLA  DEFINITIONS 


Table  A-Al  Instruction  PLA  Input  and  Output  Definitions 


Type 

Signal  Name 

INPUT 

instr-OPCODE<6> 

INPUT 

instr-OPCODE<5> 

INPUT 

instr-OPCODE<4> 

INPUT 

instr-OPCODE<3> 

INPUT 

instr-OPCODE<2> 

INPUT 

instr-OPCODE<  1> 

INPUT 

instr-OPCODE<0> 

OUTPUT 

ctrl-TrapRecvd 

OUTPUT 

instr-ldextl 

OUTPUT 

instr-ldext2 

OUTPUT 

instr-lddbl 

OUTPUT 

instr-ldsgl 

OUTPUT 

insir-stextl 

OUTPUT 

instr-stext2 

OUTPUT 

instr-stdbl 

OUTPUT 

instr-stsgl 

OUTPUT 

instr-addop 

OUTPUT 

instr-subop 

OUTPUT 

instr-mulop 

OUTPUT 

instr-divop 

OUTPUT 

instr-evtsop 

OUTPUT 

instr-evtdop 

OUTPUT 

instr-empop 

OUTPUT 

instr-fabsop 

OUTPUT 

instr-fiiegop 

OUTPUT 

instr-finovop 

OUTPUT 

instr-loadop 

OUTPUT 

instr-storcop 

OUTPUT 

instr-MD/AS 

OUTPUT 

instr-AS/MD 

OUTPUT 

instr-evTtop 

OUTPUT 

instr-fpuArithop 
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Table  A.A2  IFSM  PLA  Input  and  Output  Definitions 
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Table  A-A3  Arithmetic  PLA  Input  and  Output  Definitions 


Sienal  Name 


arith-cycleclock<4> 
arith-cycleclock<3> 
arith-cycIeclock<2> 
arith-cycleclock<  1  > 
arith-cycleclock<:0> 
arith-subop 
arith-mulop 
arith-divop 
arith-fabsop 
arith-fnegop 
arith-fmovop 
arith-fcvrtop 
arith-MD/AS 
arith-AS/MD 
arith-arithop 
arith-opexcept-detect 
arith-sign-muldiv 
arith-expn-BgtA 


INPUT 
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appendix  B:  SCHEMATICS 


Figure  A-Bl  Detailed  Diagram  of  the  Main  Control  Unit 


Figure  A-B4  Static  Latch  with  Asynchronous  Clear 
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APPENDKC:  layout  plots 


Figure  A-Cl  2and2or 


Figure  A-C2  2nand2nand 
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FigureA-C4  2nand3nand 


Figure  A*C6  2nand4nand 
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Figure  A-C7  2andSor 


Figure A-C8  2nand5nand 
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Figure  A-CIO  3nand2nand 
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Figure  A-Cll  2:1  Static  Mux 


Figure  A*C12  Static  Latch  With  Asynchronous  Gear 
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Figure  A-C13  Increment  Bit 
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Figure  A-C14  5-Bit  Cycle  Counter  With  cycleclock-init  Logic 
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Figure  A-C15  Load-Store  Pipeline  With  Memory  Control  Logic 
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Figure  A-C16  Exponent,  Fraction,  and  Multiply-Dividc  Random  Logic 
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Figure  A-C17  Interface  Control  Unit 
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APPENDIX  D:  CONTROL  SIGNAL  DEFINITIONS  AND  IMPLEMENTATION 

MULTIPLY-DIVIDE  CONTROL  SIGNALS 


OB'Utcb'IBW 

esi-latcb-mldfv 


ca>*  Incb'DHVBr 
esl>l«tcb>fnUdiv 
cBrl-muIop 


cm*IatciH9*ner 

fyt-laarA.wwilAlV 

eaEl*divop 
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MULTIPLIER  BYTE  SELECT  CONTROL  SIGNALS 


U>  Booth  recoder 
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nm 

dodel-cvrtsp 

teUl-MD 


mz2 

cydaclock-JB^l 


eB*Uicl^opGL 

Gttl>Utcb<opi 


ctri-fir«t.>Lki  ptnivr 
m-pn  c)n« 


tm  ^ihr 


ftmC'S^A'WfB 


fW 

e^-cn^ap* 

etri*opuicBp(>^KCt* 


FH12 
cfrl-criihep 
cyckdock»ipit»i 
PHU 
etri-AS^MD 
cycladoek' inx«3 
PH12 
c&rl-ouiap 
cy  cleciock*  iBM 

mi2 

cydeclack-nt»21 


EXPONENT  CONTROL  SIGNALS 


mAauik^icpiiiiB 


""y^—  a»4«sch*^Bai 

:{> 


,  c»l-6wui>cli  piiv« 


— 

1  M 

U 

0  X 

i> 


cBd'wrte-viBBMvlti 


O— 

3>-^ 


a>- 


ui  Urrh 


jPHI2 

cBl*tfitb0p 

cyckdoBk-BiSBl 

m2 

eferi'AS-t)^ 
cyciedock' aum3 

m2 

cari«timlQ9 

cydadoEk-DiM 

FH12 

cydadock'ioi^l 


o- 

o 

o 

=0 


t> 


QB*UtdK>pGL 
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FRACnON  CONTROL  SIGNALS 


decW-imlo|i^3 

PHu 

doeklt 

d^9 

ef-4ittfr<rt>oot 


cBd-|Bto-PPS«C>faeBw 


diSoi 


mi 

no4 

clockl9 

cf'wl^-AS 

cart>MDeycl»>p«nivc 


d2a  0“ 

cyd^Jock-SSi  =D- 

ef-rid—a. 

cfJSSJS*  d-wiid-nd 


dS? 


mi 

eydeeiock-a»jiK3 


=0-  PH12.1* 


jl-to-K2  ^ —  d-wTili-oonn* 

CMMaRSAiUB  ^ 

d.*g^  d-rtldlDd- 
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3D" 

doski^AS 

FHI2 

Mcleclack*iDit«3 

'^.AS^MD  g  J-  rfjTMriwt 
rfoTiMlai 


Ml 

art-wUHiiifapp 

PHn 

dod^AS 

Ml 

decl^4Bilap 

Ml 

dockX 


cf*l«tcb*aanBdtfi 


frafr'Shift'GTl 

ttl-cvrtop 

CBi-«itbap 


^1d- 


ft*£*Shif^On 

cad-AS^^MD 

flBl-MZ>4>AS 


M4 

cf-aoTBMlis 

dackS'AS 

Ml 

cf-noroMlije* 

ekKk2>AS 

Ml 

deck7<OBl<9 

Ml 

dockX 


ed-Uifib-opi 


M4 

ef'VCRDdiK 

deckle  AS 

Ml 

d^Borndda* 

dudO-AS 

Ml 

dodi;7>mdqp 

Ml 

dnckX 

Ml 

CBl*dBt'«nlbap 


3> 


3> 

C> 

i> 


fO 


cf<Uecb>iDcoat 


M2 

cyctodock-ioit»2 

cel«AS-*^MD 

M3 

dock2rAS 

PHD 

doek7>iinUop 

PHD 

dockX 


col'&mcyda-pnhw 

fr«ft<anyiii*MD 


PHQ 

c)rd»dgck>ini^2 

csd-AS'tMD 

PHD 

daEk2'AS 


M3 

dock7'BBlpp 

PHD 

dockX 


d'laicb*roiiDdAtA 
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MISCELLANEOUS  CONTROL  SIGNALS 


Ml 

^riAay 


Ml 

eai.iart_«iifaap 


=o 


CBl-bccb«T>* 


cbIJOVAS 


c>dJSn(c)idc_pum 


ctA-ti*cydm  pwiv* 

=C>Lrv_ 

I  >—  eBl>IDcyck_pMiive 

dockl 

crtJMIVAS 


ctil_AS/MD  —y'V 
cfrLMD/AS  “L' 


c8l_sidMp 


earl.AS/MD  -1^ 
etti>IIVAS 


csd  *0111)011* 


SIGN  BOX  CONTROL  SIGNALS 
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MEMORY  CONTROL  SIGNALS 
DECODE  STAGE 


tD  cvl-icfRonbA 


SECOND  STAGE 


Um4 


mAi  -TV 

wmxxl  dmuaid  mn 

ABXtl  ■  dllBMl  tmitt] 

mxi2  —  1  I-  el»copd-«m2 


dBCflBd<ttglat2 
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MEMORY  STAGE 


M  :0“ 


WRITE  STAGE 


Rd 

—  ■  ■«  t9  CBti'CSpBabA 

■■■ 

-  d«Ti»4^ 

Uftl  ■ 

'■  d»ii*  Iddbl 

MISCELLANEOUS  SIGNALS 
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appendix  E:  DATAPATH  CAPACITANCES,  BUFFER  SIZES,  AND  DELAY  TIMES 


BLtXK 

CELL 

■ 

EAbusdr 

EBbusdr 

ebusBinv 

AsubB17 

AbarB17 

AsubB17 

AbarB17 

E2 

MuxBgtA17 

E2 

MuxBgtA17 

E2 

Egel28 

BsubAl? 

BsubA17 


E3 

E3 

E3 

E3 

E3 

E3 

E3 

E3 

E3 

E3 

E3 

E3 

E3 

E3 


EGsubEL 


E4 


Cap(fF 

1549 

1549 

1549 


Buffer 
40  " 
40 
40 


CASE  CONTROL  INFORMATION  FOR  EXPON  ^NT  LAYC  UT 
I  SIGNAL flayout name)  I  Cap(fF)  Buffer  __E 


ctrl-read-regsB*  (en*) 
ctrl-rcad-regsB*  (cn*) 
ctrl-write-arithresults*  (en*) 


PHIl  (phi) 

PHIl*  (phi*) _ _ 


cxpn-BglA*  (BgtA*) 
cxpn-BgtA  (BgtA) 
rft-latch-expndiff  (phi) 


PHIl  (phi) 

PHIl*  (phi*) _ 


EAlatch  ctrl-latch-ops  (phi) 

EAlatch  ctrl-latch-ops*  (phi*) 

EBlatch  ce-latch-expnB  (phi) 

ppiatrh  ce-latch-expnB*  (phi*) 

MuxOPG  ce-mux-Oro  (EB>EA) 

MuxOPG  ce-mux-OPG*  (EB>EA*) 

latchOPG  cc-latch-opGL  (phi) 

latchOPG  ce-latch-opGL*  (phi*) 

MuxOPLl  ctrl-MD+AS*  (MD+AS*) 

MuxOPLl  ctrl-MD+AS  (MD+AS) 

MuxOPL2  ctrl-firstcycle-passive*  (Icyc*) 

MuxOPL2  ctrl-firstcycle-passive  (Icyc) 

latchOPL  ce-latch-opGL  (phi) 

latchOPL  ce-latch-opGL*  (phi*) _ 


SubXorl?  ce-adder*  (CEsub) _ 


MuxEdestl  ctrl-arithop  (arilh) 

MuxEdestl  ctrl-arithop*  (arith*) 

Muxl7Edest2  frac-zerodet  (Odet) 

Muxl7Edest2  frac-zerodet*  (Oder*) 

MuxEdesG  ctrl-firstcycle-passive  (Icyc) 

MuxEdest3  ctrl-firstcycle-passive*  (Icyc*) 

Ereslatch  ce-latch-dest  (phi) 

Ereslatch  ce-laich-desl*  (phi*) 

Eresbusdr  (ce-write-to-busB)*  (en*) 


699 

40 

939 

40 

869 

40 

852 

40 

515 

40 

699 

40 

939 

40 

854 

40 

905 

40 

854 

40 

905 

40 

852 

40 

835 

40 

597 

40 

734 

40 

858 

40 

814 

40 

902 

40 

835 

40 

597 

40 

734 

40 

1330 

40 

852 

40 

835 

40 

890 

40 

822 

40 

852 

40 

835 

40 

597 

40 

734 

40 

1642 

40 

3.38 

3.32 

5.00 


1.29 

0.57 


10.47 

9.58 

5.20 


1.29 

0.59 


5.06 

4.13 

6.23 
5.84 

2.24 
2.83 
6.89 
6.16 

1.24 
0.46 
4.05 

3.18 

6.18 
5.49 


5.88 


4.13 

3.26 

1.33 

0.52 

4.05 

3.18 

7.44 

1.96 

6.50 
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WORST  CASE  CONTROL  INFORMATION  FOR  FRACnON  LAYOUT 

'  I'"'  '  _  >1  ^ _ I  I 


BLOCK 


rPT  T  SIGNAL  Payout  name) 


ctrl-read-regsB*  (en*) 
ctrl-read-regsB*  (en*) 
ctrl-write-arithresults*  (en*) 
ctrl-latch-<q)s  (phi) 
ctrl-latch-ops*  (phi*) 
ctrl-latch-ops  (phi) 
ctrl-latch-ops*  (phi*) 
expn-shfA-frac  (CeBgtA) 
expn-shfA-frac*  (CeBgtA*) 
expn-shfA-frac  (CeBgtA) 
expn-shfA-frac*  (CeBgtA*) 
ctrl-AS/MD  (AS-i-MD) 
ctrl-AS/MD*  (AS+MD*) 
cf-latch-opAB  (phi) 
cf-latch-opAB*  (phi*) 
cf-write-norm  (en*) 
cf-latch-lshout  (phi) 
cf-latch-lshout*  (phi*) 

Pffl  1 .2  (phi  1 .2) _ _ 


cf-latch-rshoui  (phi) 
cf-latch-rshout*  (phi*) 

PHI2.1  (phi2.1) 
cf-latch-lshin  (phi) 
cf-latch-lshin*  (phi*) 

fmuxAl  ctrl-MDcycle-passive  (MD) 

fmuxAl  ctrl-MDcycle-passive*  (MD*) 

fniuxA2  ctrl-AS/MD  (AS+MD) 

fmuxA2  ctrl-AS/MD*  (AS+MD*) 

flatchOPA  cf-latch-opAB  (phi) 

flatchOPA  cf-latch-opAB*  (phi*) 

FBxor  cf-adder*  (sub) _ 


fintlatch  cf-latch-intermed  (phi) 

fintlatch  cf-latch-intermed*  (phi*) 

fintlatdr  cm-laich-compmcd  (en*) 

fcompl _ frac-sign-intermed  (Compl) 


fiTiuxincout  ctrl-firsicyclc-passive  (SELA) 

fmuxincout  ctrl-firstcyclc-passive*  (SELA  ) 

flatincout64  cf-latch-incout  (phi) 

flatincout64  cf-latch-incoul*  (phi*) 

fincoutdr  cf-latch-lshin  (en*) 


Cap  (fF) 


5570 

5570 

5570 

2947 

3203 

2947 

3203 

2851 

2819 

2851 

2819 

2881 

2945 

2947 

3203 

5197 

2497 

3075 

4161 


Buffer 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 

60 


2146 

60 

5.85 

3021 

60 

4.36 

4223 

60 

2.07 

2242 

60 

3.16 

2627 

60 

4.26 

3577 

60 

3.24 

3122 

60 

1.19 

3252 

60 

1.74 

3187 

60 

0.38 

2277 

60 

7.92 

2799 

60 

6.35 

4481 

60 

10.10 

1973 

60 

7.59 

2588 

60 

6.17 

5291 

60 

5.22 

4559 

60 

1.59 

3083 

60 

5.97 

3083 

60 

4.28 

3087 

60 

7.04 

3355 

60 

5.55 

5953 

60 

8.65 
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WORST  CASE  CONTROL  INFOR^TION  FOR 
BLOCK  I  SIGNAL  flayout  name)  |  Cap  (fF)  ~T 


IN  cm-latch-compmcd  (phi) 

IN  cm-latch-compmcd*  (phi*) 

IN  ctrl-latch-ops  (phi) 

IN  ctrl-latch-ops*  (phi*) 

IN  cm-clear-div  (clr) _ 


MUX  cm-laich-slave  (phi) 

MUX  Booth  sel-1  (sel-1) 

MUX  Booth  sel+1  (sel+1) 

MUX  Booth  sel-2  (sel-2) 

MUX  Booth  sel+2  (sel+2) _ 


LATCH  cm-latch-slave  (phi) 

LATCH  cm-latch-slave*  (phi*) 


CSA  cm-latch-master  (phi) 


pp  cm-latch-master  (phi) 

pp  cm-latch-master*  (irfii*) 

pp  cm-latch-slave  (phi) 

pp  cm-latch-slave*  (phi*) 

pp  cm-clcar  (clr) 

pp  ctrl-gate-PPS/C-fracBus  (en) 

pp  ctrl-mulop  (shr8) 

pp  ctrl-mulop*  (shl2) 

pp  cm-mux-dvr  (enbusA) 

pp  cm-mux-dvr*  (enPPSm) _ 


QUO  cm-latch-masta  (MA) 

QUO  cm-latch-slave  (SL) 

QUO  ctrl-gate-quotient-fracBus  (en) 


MULTIPLY/DIVIDE  LAYOUT 
Buffer  I  Delay  (ns)  I  Labd 
‘  61,64 


11254 


3213 

3901 

3213 

3901 

7429 

7410 

3875 

3943 

3936 

3888 


3575 

2832 

6500 


10.41 

35.01 


24.23 

3.93 

3.87 

3.98 

3.93 


25.08 

28.84 


18.52 


17.46 

15.94 

23.62 

27.78 

38.72 

8.72 

3.89 

1.53 

8.24 

7.44 


15.57 

22.93 

8.07 


62,63 

57,60,65,69,70 

58,59,66,68 

67 


39.44.49.54 
38,43,48,53 

40.45.50.55 

41.46.51.56 
37,42.47,52 


30323436 

29313335 


25363738 


1434 

1333 

8,16 

7.15 

1232 

1131 

9,19 

1030 

17 

18 
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EXPONENT  CONTROL  SIGNAL  PLACEMENT 


Efile 

I**" 

Eregfile 

(MO) 

EAbusdr 

El 

(234) 

a* 

EBbusdr 

ebusBiiiv 

a* 

AbarB17 

AsubBl? 

(388) 

MuxBgtA17 

E2 

Egel2g 

(100) 

laut 

BsubAl? 

AbarB 

(366) 

I**.!*** 

EAlatdi 

EBUtch 

E3 

MikOPG 

EMA.BMA* 

latcfaOPG 

Eshiftcvt 

muxOPLl 

(368) 

W>*Al.M3»fAS* 

a  1 

mmOPU 

latdiOPL 

fatjar 

SubXorl7 

EGsubEL 

Ota* 

(353) 

moxEdestl 

-a  — 

niuxl7Edeat2 

E4 

inuxEdest3 

lac, 

Ereslatd) 

(217) 

jaaiai* 

Eresbuidr 

a* 

PHI2+PHI4 


ctii-read-regsB* 

ctrl-re*d-regsB* 

ctrl-wriie-uithretults* 


FHIl 

(ooput) 

(oapm) 

expn-BgtA 

(output) 

ce-litch-eipfidifT 

(output) 

PHIl 

ctri-latcb-ops 

oe-laicb>e3tpaB 

ce-mux-opC 

ce-Uicb-opGL 

(output) 

ctri-M0+AS 

Cmiwt) 

ctii-fintcyde-passive 

ce-Utch-opGL 


ce-adder* 

e<afTym 

ctil-ahthop 

fnK-zaodet 

ctri-fintcyde-paisive 

ce-laich-dest 

(ce-dest-to-busB  -t- 

ctri-write-arithresults)* 
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FRACnON  CONTROL  SIGNAL  PLACEMENT 


Ffile 

(917) 

FAbusdr 

•* 

FBbusdr 

RatdiA 

FlatdiB 

muxFG 

FI 

s»A.ai»A* 

cmmKOam^* 

finnxB 

(622) 

flatdiOPB 

lihfoutbusdr 

•* 

flatlahfout 

l*tl*** 

SHIFTER 

(1389) 


fUtchnhfout 

flatchlshfin 

&nuxAl 

fmuxA2 

flatcfaOPA 

FBxor 


F2 

(262) 


MD.MD* 

A»>MD.A*-MO* 

IfcJ*** 


adder66 

(549) 


finilaidi 

finUaidr 

f64zerobot 

foompl 

IRIPLI 


F3 

(295) 


u.u* 

mi.u* 

r.p* 


inc64 

(549) 


finuxincout 

Baiiocout 

fincouulT 


F4 

aio) 


pHBi.  moi,  fwu.  rw2i.  mst 


detector 

(463) 


I»HI2+Pffl4 


ctri-read-regsB* 

ctH-read-re|sB* 

cui-write-anihresulu 

ctil-latch-ops 

ctri-latdi-op« 

expn-thfA-frac 

expn-thfA-frac 

cni-AS't'MD 

cf-latch-cpAB 

cf-wrtte-oonn* 

PHI1.2 

cf-Iatch-Uhout 


cf-Iatch-nbout 

PH12.1 

cf-Utch'Uhin 

ctri-MDcyde-pawive 

ctri-AS+MD 

cf-Utch-opAB 

cf-adder* 

f-canyin 


cf-Utcb-inienned 

(output) 

cm-lalch-cocnpmcd* 

(output) 

PHI2 

frac-iign-imenned 

(output) 

(front  Teat  RlPLl) 


ctri-firstcyde-patfive 
cf-latch-inoout 
(cf-laich-lshin  + 
cf-write-tnd)* 

PHIl* 

(OUTPUT) 

(OUTPUT) 
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MULDIV  CONTROL  SIGNAL  PLACEMENT 


ctil-gate-<]uocieDt^ncBus 

Onput) 

cm-Utdi-inaster 

cm-liuji-tlive 

ctil'gate-qiiocient-fncBiu 

Cmpat) 

on-Utch -master 
cm-latch-sUve 

cm-Uicfa-slave 

ciri-mulop 

ctfi-g»te-PPS+C-fi»cBus 
an-latch -master 
cm -clear 
cm-latch-slave 
cm -mux -dvr 
cirl-ffiulop 

ctrl-gate-PPS+C-fracBus 
cm-Utch -master 
cm -clear 

cm-latcfa-master 

cm-latcfa-master 

cm-latcfa-master 

cm-latcfa-master 


cm-latch-slave 

cm-latcfa-slave 

cm-latch-slave 

cm-latch-slave 

aooTH 

aoaiH 

cm-latch-sUve 

BOOTH 

BOOTH 

BOOTH 

cm-latch-tlave 

BOOTH 

BOOTH 

BOOTH 

BOOTH 

cm-latch-slave 

BOOTH 

BOOTH 

BOOTH 

BOOTH 

cm-latch-slave 

BOOTH 

BOOTH 


ctrl-latch-ops 

ctrl-latch-ops 

cm  -latch-oompmcd 

cm-latch-oompmod 

ctil-latdi-ops 

csri-latch-ops 

cm-clear-<Uv 

cm-bytesel-mpr<8:0> 
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APPENDIX  F:  SIMULATION  DATA 


1.  SPICE 

1.  Extract  circuit  description  from  within  Magic:  “:ext" 

2.  Obtain  .sim  file  from  ext2sim:  "ext2sim  -R  -c  Ie-I8 

3.  Remove  any  attributes  from  ^m  file,  such  as  die  CnInS  labels  used  in  CRYSTAL.  In  vi. 
"%s/d=CrIn$//g".ctc. 

4.  Obtain  .spice  file  from  sim2spice:  "simTspice -d  “Anisc/der 

5.  Add  input  and  model  cards  to  .spice  file. 

6.  Submit  .spice  file  on  EROS:  "^ice  <  file.spice  >  file.rait" 

Typical  "/misc/def  file: 

def  p  P  scmosn 
def  n  N  scmosll 
set  Vdd  1  scmosll 
set  GND  0  scmosn 
setPl 
set  NO 


Model  parameters  used: 

TYPICAL  De\ice  parameters  for  the  HP  CMOS40  Process 


Released  2/6/86  by  Rich  Duncombe 
NOTE:  These  parameters  are  intended  for  digital  design  only. 


*  Use  N  and  P  models  for  W  >=  4U  and  L  <=  2U 

/  MODEL  N  NMOS  LEVEL=2  VTO=  0.75  KP=76.0U  GAMMA=.40  LAMBDA=.025  TOX=25N 
+  NSUB=4E16  TPG=+1  XJ=.25U  LD=.20U  UEXP=.16  VMAX*5.5E4  JS=1000U 
+  CGSO=220P  CGDO=220P  a=230U  aSW=260P  CGBO=4(X)P 

/.MODEL  P  PMOS  LEVEL=2  VTO=-0.75  KP=27.0U  GAMMA=.50  LAMBDA=.045  TOX=25N 
+  NSUB=2.0E16  TPG=-1  XJ=.20U  LD=.05U  UEXP=.15  VMAX=9.0E4  JS=1000U 
+  CGSO=220P  CGDO=220P  a=670U  aSW=215P  CGBO»400P 
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2.  CRYSTAL 

1.  Extract  circuit  description  from  within  Magic;  ;cxt 

2.  Obtain  jrim  file  frc«n  ext2sim  "ext2sim  -R  -c  le-18" 

3.  Start  up  CRYSTAL:  "crystal" 

4.  Read  in  typical  source  file:  "source  file.crystal" 

Typical  source  file  (for  exp-ccrntroLsim): 

source  crys_paiin 
build  exp-control^im 
alias  exp-controLal 

inputs  •PHI  •cycle  •clock  •ctrl-start  •ctrl-AS  •clrl-mulop 
outputs  <1:3S> 
watch  1  2 

options  graphics  magic 
options  bus  12 
options  watchpaths  10 
capacitance  1.6  1 
capacitance  1.6  2 
set0PHI3& 
setOPHI2& 
setOPHI4& 
delay  PH11&  00 
critical  -g  gatequojncrit 
critical  Iw 


CRYSTAL  parameters  used: 


!  crystal  paramter  release  V2.1(2.10.86) 

!  based  on  HPCMOS40 1.6um  Process 
!  extracted  by  Wook  Koh  mail  problems  to 

Lan  nchan  slopeparmsdown  0.000,8000.0.7;0.113.8500.0.7;0.271.9500,0.7;0.771, 11500^.7;  2.527  155M  1.0;  7.534  2( 
tran  nchan  sloreparmsup  0.000. 8000, 0.7;  0.113, 8500. 0.7;  0.271, 9500, 0.7;  0.771, 11500, 0.7;  2.527, 15500, 1.0;  ^534. 
tran  pchan  sloj^parmsdown  0.000, 20000. 0.8;  0.488. 25000, 0.9;  1.599, 35000, 1.0;  ^000  1.5;  ^ 

tran  pchan  slopeparmsup  0.000, 20000, 0.8;  0.488. 25000, 0.9;  1.599, 35000, 1.0;  4.800, 55000, 1.5;  47.828, 194000  5.0, 
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3.  MOSSIM 

1.  Extract  circuit  description  frcxn  within  Magic:  ":ext" 

2.  Obtain  ^im  file  from  cxt2sim:  "cxt2sim  -R  -c  le-18" 

3.  Obtain  .ntk  file  from  sim2ntk:  "sim2ntkfile" 

4.  Start  i5>  MOSSIM:  "Mossim" 

5.  Read  in  Jitk  description:  "read  file" 

6.  Read  in  typical  source  file:  "source  file" 

sim2ntkfile: 

#!/bin/csh  -f 

if  ($#argv  !=  1)  then 

echo  "Usage:  siin2ntk  file" 

exit  1 

endif 

onintr  end 

set  file=$l.sim 

rm  -r  SlJitk 

echo  sim  $l^m  >!  temp.SS 
echo  type  vdd:i  gnd:i » temp.SS 
echo  stren  ratio:3.0  »temp.$$ 
echo  size  ratio:3.0  » temp.$$ 
echo  write  $  1  Jitk  » temp.$$ 
echo  quit  » temp.SS 
convert  <  temp.SS 
end: 

/bin/rm  -f  temp.SS 


Typical  source  file  (counter.src): 


copy  counter .cpy 
comment ' 

comment  Logical  Simulation  for  5-bit  Counter 
comment  ’ 
comment 
switch  explain:! 

clock  phil:  10000000  phil*:01 1 1 1 1 1 1  phi2:00100000  phi2*:  11011111- 
phi3:00001000phi3*:11110111  phi4:00000010 


force  Vdd:l  GND:0 
vector  clear*  (clr&phi3)* 
prefix  5bitinc_0 

vector  A  incbit_3/ai  incbit_2/ai  incbit_l/ai  incbit_0/ai  incbit0_0/ai 


unprefix 

vector  /b  S  s4  s3  s2  si  sO 

vector /b  SPHIl  s4_phil  s3_phil  s2_phil  sl_j*il  s0_phil 
watch  /*  /b  5bitinc_0/A  S  SPHIl  clear*  start  busy  inc  cout 
set  /b  5bitinc_0/A:00000 
set  /b  clear*  :0  start:0  busy:0 


